Google DeepMind: The Podcast - 与佩奇·贝利一起快速了解AI创作之旅 封面

与佩奇·贝利一起快速了解AI创作之旅

A Whistle Stop Tour of AI Creation with Paige Bailey

本集简介

本周的节目与我们惯常的深度探讨略有不同。请跟随开发者关系负责人佩奇·贝利,一同带领汉娜·弗莱体验她最喜爱的一些AI工具。在多年研究这些模型的"是什么"之后,汉娜终于有机会亲身体验"怎么做"——从生成提示词、"氛围编程",到创作她自己的著名意大利面条梗图版本。 了解更多并亲自尝试这些工具: Gemini:https://gemini.google.com/ Google实验室:https://labs.google/ AI工作室:https://aistudio.google.com/ Veo 3:https://deepmind.google/models/veo/ Flow:https://labs.google/flow/ 特别感谢为此节目付出努力的各位(包括但不限于): 主持人:汉娜·弗莱教授 系列制片人:丹·哈杜恩 剪辑:拉米·察巴尔 监制与制片:艾玛·尤西夫 音乐作曲:埃莱妮·肖 音频工程师:理查德·考蒂斯 制作经理:丹·拉扎德 录音棚经理:尼古拉斯·杜克 视频导演:伯纳多·雷森德 视频剪辑:比拉尔·梅尔希 音频工程师:佩里·罗甘廷 摄影与灯光操作:罗伯特·梅塞尔 制作协调:佐伊·罗伯茨、莎拉·艾伦·莫顿 视觉标识与设计:罗布·阿什利 由Google DeepMind委托制作 若喜欢本期节目,请在Spotify或Apple Podcasts上为我们留下评价。我们始终期待听众的反馈,无论是意见、新想法还是嘉宾推荐!

双语字幕

仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。

Speaker 0

人类创造力即将迎来一场爆炸式进步,人们有望不仅能成为某一特定领域的创作者,还能将这种创造力拓展到众多其他领域。

Human creativity is about to have this explosion of progress, and there's this promise of everyone being able to become a creator and not just a creator in one specific discipline, but to also be able to expand that out into many other disciplines.

Speaker 1

欢迎收听《谷歌深度思维》播客,我是汉娜·弗莱教授。本播客始终致力于带您接触那些从事人工智能重大突破的研究者。多数时候,他们讨论的是支撑宏大构想的技术基础。而现在,我们早期见证过的工具迭代正越来越多地投入实际应用。

Welcome to Google DeepMind, the podcast. I'm professor Hannah Fry. Now one of the things that we've always done with this podcast is to bring you access to the people who are working on some of the biggest breakthroughs in AI. And a lot of the time, the researchers, they're talking about techniques and technology that underpins big ideas. But now we are at a stage where more and more of the tools that we have seen the early iterations of here are now live.

Speaker 1

这些工具已面向公众开放使用。本期节目我们将暂停脚步,盘点已发布的各种工具,探讨它们自初次亮相以来的演变,并探索其多元化的应用场景。要实现这个目标,没有比谷歌深度思维AI开发者关系工程负责人佩奇·贝利更适合的向导了。佩奇,欢迎来到节目。

They are out there in the world for you to interact with. So what we wanted to do in this episode is just to to pause, to look at the array of tools that have been released, talk about how they have changed since we first encountered them, and to explore the myriad ways that they can be used. And if that is our objective for today, well, there is no one better to show us this progress than Paige Bailey, AI developer relations engineering lead at Google DeepMind. Paige, welcome to the podcast.

Speaker 0

非常感谢邀请,能和大家分享谷歌深度思维的研发成果让我非常兴奋。

Thank you so much for having me. I'm so excited to talk more about what we've been building at Google DeepMind.

Speaker 1

关键在于我们见证了这些技术的早期版本。去年道格·埃克做客时展示的初代VO模型,与如今发布的VO三代相比简直判若两人。

The thing is is that we get to see a lot of the early iterations of this stuff. Yes. And last year, we had Doug Ek on the show. And he was showing us, I think, the very first iteration of of VO Yes. Which now, with the launch of VO three is, I mean, quite a different beast.

Speaker 0

确实如此。初代VO模型仅支持视觉功能,缺少VO三代丰富的音效特质。当时还需要提供详细指导才能生成接近照片真实感或电影质感的画面。但我们已经取得了长足进步——其实我很想看看那个初代视频的模样。

Exactly. Like, the first implementation of the VO model was still, you know, just visual only, not including all of the the really enriching sound qualities that we see from the v o three model. And you also had to give it pretty significant guidance in order to get the model to produce something that looked photorealistic or even like something that you might see in a cinematic film. But we've come a long way. I would be really curious to see what that first video looked like.

Speaker 1

没错,我想我们确实...

Yeah. So I think we have

Speaker 0

保存了那段素材。来吧,一起看看。

it, actually. Yeah. Let's do it.

Speaker 2

容我描述即将出现的画面:镜头将以追踪视角从顶部切入,逐渐下移展现霓虹全息投影的跑车以光速疾驰,充满电影感。随后车辆驶出隧道,重返现实中的香港都市。预计会看到切换回香港街景的转场。

If I were to describe what you should see, we're gonna come in from the top as a tracking shot. We're going to come down from the tracking shot. We're going to have this neon hologram of a car driving at the speed of light, cinematic. And then the car leaves the tunnel back into the real world city of Hong Kong. So we should expect a kind of transition back to Hong Kong.

Speaker 1

好的,开始了。画面正如提示词所述——建筑群覆盖着霓虹灯饰,这描述再准确不过。

Alright. Okay. So we're starting off. There's no other way to describe it than what what the prompt said. You got these buildings covered in neon lights.

Speaker 1

这个跟拍镜头非常流畅。然后你加速并不断拉近镜头。现在你穿梭在建筑物之间。接着我们看到一辆车在街道上飞驰。你可以看到霓虹灯光反射在下方潮湿的路面上。

It's very smooth, this tracking shot. And then you speed up and zoom in closer and closer. You're in between the buildings now. And then we have this car racing through the streets. You can see the neon lights reflected in the wet pavement below.

Speaker 1

周围还有其他车辆在争夺位置,由于速度太快,几乎所有的东西都变得模糊。但画面非常连贯。现在它穿过了一条隧道。头顶有巨大的照明灯,驶出隧道后进入了一个极度逼真的现代场景。

There's other cars jostling for position around, and it's almost like everything is blurred as you're because you're just going so fast. But it's really consistent. Now it's gone through a tunnel. There are these big lights overhead, and it's come out of the tunnel into an extremely realistic modern scene.

Speaker 2

这是个惊艳时刻。

It's a wow moment.

Speaker 3

我是说,

I mean,

Speaker 1

这真的很棒,不是吗?太出色了。现在我注意到一些细节,画面确实相当模糊,我想这是它想要营造的某种氛围效果。

that is really good, isn't it? It is so good. So some things that I I noticed now, it is quite blurry, and that's I guess that's some part of the vibe that it's going for.

Speaker 0

是啊。

Yeah.

Speaker 1

但你并没有看到汽车上那种完美的细节。

But you're not seeing this pristine detail on the car.

Speaker 0

完全没有。我觉得如果我们仔细观察,可能还会发现表现镜头运动的物理效果不太准确。而且光线在物体上的反射方式也不完全一致。让我们看看新的VO3模型表现如何

Not at all. I think if we also looked really closely, we might also see that some of the physics that's expressing the shots is not quite right. And the way that's kind of the light gets reflected on things is also not necessarily quite consistent. So let's see how well the new VO3 model

Speaker 1

你正在用完全相同的提示词进行测试。

You're does for using exactly the same prompt here.

Speaker 0

完全相同的提示词。现在我们是在Gemini应用中生成这个视频。可能需要两到三分钟才能得到正确的输出结果。

Exactly the same prompt. And here we're in the Gemini app generating this video. It can take up to two to three minutes in order to get it to have the right outputs.

Speaker 1

那么,让我来问你吧。有一个非常引人注目的地方,就是道格的提示词写得非常富有诗意。

Well, me ask you then. One thing that was really noticeable was the way that Doug's prompt was very poetic.

Speaker 0

确实很美。

It was lovely.

Speaker 1

简直太棒了。没错。它本身就是一部微电影。是的。你有哪些生成这类提示词的技巧?

It was gorgeous. Yes. It a little mini movie in its own right. Yes. What are your tips for generating these prompts?

Speaker 0

有趣的是,我们在VO3模型中新增了一个叫提示词重写的功能,它能让你以我们与模型交互的方式向API输入句子,并得到反馈,使提示词更加详细,更符合你的想象。这样你就不必费尽心思去组织合适的术语来描述你脑海中的画面。我经常用Gemini做类似的事——你可以把构思中的提示词和目标告诉Gemini,让它为大型语言模型或视频生成模型优化提示词,这样更有可能产出符合预期的理想结果。

So interestingly, we have a new feature with the v o three model called prompt rewriting, which gives you the ability to give your input sentence to the API the way that we interact with model and to get a response back that makes the prompt a lot more detailed, a lot more aligned with perhaps what you're imagining. So you don't necessarily have to go through all of the mental work and all of the terminology that would be appropriate to kind of describe this thing that you just imagined. And I often use Gemini for something quite similar, is you can give Gemini the prompt that you're thinking and your objective and then ask it to craft a prompt for a large language model or a video generation model in a way that would make the prompt much more likely to produce the optimal output that you were expecting.

Speaker 1

这实际上是个相当实用的技巧。如果你不太擅长写提示词

That's quite a good tip then, actually. If you are not that good at writing prompts

Speaker 2

没错。

Yes.

Speaker 1

就让Gemini帮你写出更好的提示词。

Get Gemini to write the prompt better.

Speaker 0

是的。你可以直接通过Gemini应用自然调用这个功能,这对大多数用户来说最方便。但如果更熟悉编程语言,我们也为VO3提供了专门的提示词重写API接口。

Yeah. And so you can you can just invoke it naturally through the Gemini app, which is probably easiest for people to try. But we also have the prompt rewriter hyper specified for v o three available through the API if people are a little bit more comfortable with programming languages.

Speaker 1

那么VO生成的视频片段有多长?

So how long are the VO clips?

Speaker 0

目前公开的、可供用户试用的VO片段时长都在8秒左右,非常短。

The VO clips for the ones that we released publicly, so the ones that people can try, they're all around eight seconds in size, so very short.

Speaker 1

为什么必须是八秒长?

Why does it need to be eight seconds long?

Speaker 0

所以公开提供的是八秒版本。内部我们有能力生成长篇内容的模型,但发现八秒时长能让你充分掌控首个片段的创意表达。这也有助于你快速把握风格基调,开始尝试语言实验。更妙的是,你能将此前脑海中的想象具象化呈现。

So it's eight seconds made available publicly. Internally, have models that are capable of producing much more long form content, but we find that eight seconds is really good to kind of give you full creative control over that first clip. It's also useful in the sense that you can kind of get an idea of the style. You can start experimenting with language. And it's also wonderful in the sense that you can start, like, putting to life things that you might have been imagining before.

Speaker 0

我知道全网都对梗图着迷,现在你们能制作更长的视频梗图,而不仅是静态快照了。

I know that all of the Internet is very enchanted with memes, and now you can have memes that are much more long form that are actual videos as opposed to just a single snapshots.

Speaker 1

完全同意。好的。准备就绪了。

Absolutely. Yeah. Okay. It's ready.

Speaker 0

太棒了。开始吧。天哪。

Amazing. Let's do it. Oh my gosh.

Speaker 3

这个已经

This already

Speaker 1

突然间。

All of a sudden.

Speaker 3

这太酷了。

This is so cool.

Speaker 1

好的。现在哦,哇。我的天。是的。好的。

Okay. Now oh, wow. Goodness me. Yeah. Okay.

Speaker 1

就像有人突然开启了高清模式。霓虹灯从那种粉得吓人的色调变成了完整的广告牌——虽然疾驰而过时看不清细节,但能看出结构层次。车身现在仿佛被流动的光影重新涂装,看看引擎盖上的光影变化,天啊。

It's like suddenly someone has switched on HD. So the neon lights have changed from these sort of pink sort of horrifying neons to full on billboards, which you can't totally see the detail of as you go past them, but there's structure to them. The car is now sort of painted with light almost running through the scene. But also, look at the lighting on that bonnet. Oh, gosh.

Speaker 1

这太不可思议了。

That's extraordinary.

Speaker 0

确实美不胜收。所有建筑的细节层次如此丰富,这看起来真像未来香港的城市景观。

It is gorgeous. And that level of detail in all of the buildings, this really does look like a city in a futuristic Hong Kong.

Speaker 1

我注意到这里有聚光灯效果。一辆车驶出隧道时,上方有个小小的路灯。嗯哼。聚光灯会精准跟随车头引擎盖的移动轨迹。

What I'm noticing here is there's a spotlight. A car comes out of the tunnel. There's a little a a a lamppost Mhmm. Above it, and the spotlight perfectly tracks where you would expect it to be along the bonnet of the car.

Speaker 0

太神奇了。你听到声音了吗?

That's amazing. Do you hear the sound?

Speaker 1

轮胎摩擦声。是的。哦,背景还有警笛声。但声音与画面帧完美同步。

The screeching tires. Yes. Oh, sirens in the background. But the sound perfectly matches the frames then.

Speaker 0

确实如此。VO3生成的这些视频,其背景音轨不仅能匹配车辆等元素,还能根据需求添加电影配乐——将背景噪音与音乐完美融合,最终缝合为完整视频。

It definitely does. And the the background tracks the sounds for these videos they get generated with v o three. They're actually able to match not just things like the cars, but also capable of giving you background music if you wanted to have kind of cinematic music coupled with the audio or coupled with the background noises. All of this kind of stitched together into a single video.

Speaker 1

让我确认下:VO3相比之前有哪些突破性改进?优势何在?

So let me understand then. What is new in v o three that we didn't have before? What makes it better?

Speaker 0

VO3的优化主要体现在三方面:首先是新增的声音生成功能,这点让很多人着迷;其次是视频输出基于更强的物理规律理解——当我们观察视频时,会发现光影效果和重力表现都更符合现实世界。

Yeah. So v o three is better for a few reasons. One is that it has the ability to produce sounds, which has been really enchanting a lot of people, I think. Another way that we've improved v o three is that the video outputs are grounded in more physics understanding. So as we as we look at videos, we can kind of spot ways in which the light or gravity really does seem to align with the physical world.

Speaker 0

此外在角色一致性方面也有重大提升。某些过去难以实现的效果,现在都能被生动呈现出来。

And then there's also been a lot of improvements around character consistency. So some of these things that, you know, perhaps would have been not possible a while back, You can kind of see them expressed and brought to life.

Speaker 1

好吧,但这些新增功能是如何实现的?

Well, okay. But how are these additional features possible?

Speaker 0

DeepMind非常、非常注重如何筛选用于训练模型的数据。当你思考这个问题时,你能看到‘树’这个词。你可以看到树的图片,或者三维立体的树形呈现,以及其他你能想到的任何相关表现形式。

DeepMind is really, really paying attention to how the data is curated to use to train the models. When you think about it, you can see the word tree. You can, like, see a picture of a tree. You could have kind of the three d representation of trees or any of the other things that you would expect.

Speaker 1

风吹过树叶的声音。

A sound of wind blowing through leaves.

Speaker 0

没错。就像有人拍摄的环绕视频——所有这些仍然关联着同一个实体,但却是用完全不同的模态来描述同一件事。要知道,历史上人们往往只专注于单一模态,比如文本或代码之类的。而作为人类,我们通过截然不同的方式体验整个世界,从视觉到听觉再到触觉,所有这些感官。

Exactly. And it's like a video of somebody panning around. All of these things are still associated with that one entity, but it's all very different modalities describing the same thing. You know, historically, folks have been concentrating on just one modality, so text or code or something similar. When as humans, we experience the entire world in very different ways, you know, everything from seeing to hearing to touching, like, all of these things.

Speaker 0

因此我认为团队投入了大量时间和精力,不仅将视频素材与构成视频的声音耦合起来,还包括逐帧级别的详细描述,最终将所有元素编织成一个完整的训练表征体系。

And so I I think the team has put a lot of time and energy and effort into being able to couple together not just the video footage, but also the sound that composes the video footage, detailed descriptions even at the frame by frame level, and then also kind of stitching all of that together into a full representation of the training.

Speaker 1

所以纯语言模型可能只将‘树’与‘树枝’或‘嫩枝’紧密关联,而多模态版本不仅内嵌了这些词汇,还包含了音频、图像、视频等不同层面的信息。

So whereas a language only model might, you know, have the word tree and it's closely associated with the word branch or twig Yeah. The multimodal version not only has all of those embedded within it, but additionally has audio, images, video, all of these different layers.

Speaker 0

完全正确。这正是Gemini模型最让我兴奋的地方——我们是唯一能同时输出文本/代码、编辑图像、生成音频并调控音频(比如要求更轻柔/响亮或切换语言)的模型家族。其他模型都需要拼接不同训练模块,而我们将所有能力都内建于一个原生模型中。

Absolutely. I think this is one of the things that makes me most excited about the Gemini models, right, is that we're really the only model family that also allows you to output text and code, but also images, to edit images, to have output audio, as well as steerable audio. So being able to say, like, speak softer or speak more loud or speak in a different language. All of these other model families kind of relied on stitching together different trained experiences as opposed to baking it all into one innate model.

Speaker 1

我明白现在VO3最受关注的是音频功能。它的生成机制是怎样的?是根据视觉内容匹配生成音频,还是存在一个同时产生视觉和音频的上下文?

So okay. I know a lot of the buzz around this. A lot of the new thing about VO3 is audio. I mean, how is it being generated in order to correspond? Is it that it's generating something that matches the visuals or is it like there's a context which produces both the visuals and the audio?

Speaker 0

因为训练数据包含了与所见事物关联的所有模态——不仅是视频本身,还有文字脚本、逐帧画面描述、背景音轨说明等等。

I think it's because the training data has all of the different modalities associated with the thing that it sees. So it's not just seeing a video. It also has the transcript. It also has the frame by frame level description of the video and what's happening. It also has the description of the audio if there's any background tracks.

Speaker 0

所有这些信息同步整合后,就能生成更沉浸、更自然的声音和回应。就像听歌时,你既能看到乐谱感知音调,也会产生特定情绪。如果我说想听一首让人联想到雨夜日本咖啡馆的曲子,这就需要所有模态协同理解,而非单纯猜测雨声咖啡馆的声响。

And so all of that kind of brought together simultaneously is capable of generating these much more immersive and natural sounds and natural responses Because there are certainly instances where if you listen to a song, you could read the sheet music, and you could kind of hear the different tones displayed, but it could also make you feel a certain way. And so if I describe that I want to hear a song that reminds me of being in a rainy cafe in Japan at nighttime, like, that's something that you would really want to have all of these different modalities coming together to understand as opposed to just guessing at what a rainy cafe might sound like.

Speaker 1

这就像完全沉浸式体验的二维版本。

It's like a two dimensional version of a fully immersive experience.

Speaker 0

确实如此。

Exactly.

Speaker 1

是的,这很棒。

Yeah. That's nice.

Speaker 0

我也很喜欢这种描述方式。我们正越来越接近一种感觉上非常贴近现实的体验。

I love that description of it as well. We're getting closer and closer towards something that feels very close to reality.

Speaker 1

就像模拟现实一样。没错。

Like a simulated reality. Yes.

Speaker 0

而且我认为这在以前是不可能实现的。

And I don't think that was possible previously.

Speaker 1

好的。那么到目前为止这就是Gemini应用的功能。是的。但如果你是专业电影制作人,或者想更认真地对待这个,还有另一个平台可以选择,对吧?

Okay. So that thus far then is the Gemini app. Yes. But if you are a professional filmmaker or you wanna take this a bit more seriously, there is another place you can go to, correct?

Speaker 0

是的。它叫做Flow,由我们Google Labs团队的同事开发,他们直接与电影制作人合作,打造真正符合他们期待的体验。

Yes. It is called Flow. It's built by our colleagues over in the Google Labs team, and they've been partnering directly with filmmakers to really build an experience that aligns with their expectations.

Speaker 1

那么这里的理念是,你仍然有八秒的视频片段,但可以把它们拼接起来吗?

So is the idea here then, you still have the eight second videos, but you can stitch them together?

Speaker 0

你可以把它们拼接起来,还可以进行风格化处理。甚至还有相关的摄像机控制功能。

You can stitch them together. You can kind of style them. There are even camera controls associated.

Speaker 1

哦,哇。

Oh, wow.

Speaker 0

因此,它确实为电影制作人提供了更多创作掌控权。我们发现,就像音乐人需要专门环境创作电子乐或CAD设计师需要专业工具一样,每种使用场景都需要一个极度专注、高度定制的用户界面,能够极致优化电影制作人关心的各项功能。

So it really does give you a lot more creative control as a filmmaker. And we find that, you know, just as you would have specialized environments for musicians to create their electronic tracks or CAD designers, you would probably want a really, really dedicated and focused UI for each one of these use cases that can really hyper optimize for the things that you would care about as a filmmaker.

Speaker 1

我认为关键在于这扇创意之门完全敞开。没错,我见过有人把斯巴达人拍成Instagram网红的样子。对,千真万确。

I think the point about this is that you then have this absolute open door for creativity. Yes. I've seen people take videos as though the Spartans were Instagram influencers. Yes. Absolutely.

Speaker 1

报道他们的攻城战。

Reporting on their siege.

Speaker 0

完全正确。还有角色在不同场景下的形象一致性——无论光线如何变化。比如你设计了一个小怪兽角色,既要让它畅游海底,又要能攀登雪山,还要在舞台上放声高歌。系统能保持角色形象统一,同时动态调整周围环境,这简直像魔法一样神奇。

Absolutely. And also character consistency across different experiences no matter what the lighting might be. Like, you might have a little monster character that you want to have swimming through the ocean, and then you also want to have him climbing a mountain, and you want to have him singing on a stage. And it's able to keep that same character consistent, but to change all of the dynamics around it, which is pretty magical.

Speaker 1

不过我在想,把这些工具交给大众使用会怎样?确实也存在隐忧对吧?比如深度伪造,还有用虚假新闻事件诈骗的手段。嗯...你们对此持什么立场?

I do wonder though, about putting these tools in the hands of people. There are also concerns about it too, right? I mean, deep fakes, but also scams, tricking people into thinking that news events are happening, but that perhaps aren't. Mhmm. Where do you stand on that?

Speaker 0

是的。我们在VO模型内部设置了安全过滤器。所有通过Gemini应用生成的VO模型都会带有特殊水印,这样就能区分AI生成内容和现实拍摄素材。我们还设置了特殊限制,比如不能生成儿童或特定实体的图像。对于政府官员、科学界等公共领域的知名人物——

Yes. So we do have safety filters introduced within the VO models themselves. And relatedly, for all of the VO models that are generated through the Gemini app, there's a specialized watermark that gives you the ability to to kind of know that this was AI authored as opposed to being something that was just shot via raw footage out in the world. But we also have special constraints in place around not being able to generate images of things like children or or special entities. There's also a constraint in place such that government officials or people who are significantly, like, present in the public sphere for policy or for science or for any of the kind of, like, notable figures in the world.

Speaker 0

我们无法生成关于他们的视频内容。就连内部实验模型也遵循这些约束条件。

We can't generate video content about them. And even the models that we experiment with internally, they still have these constraints.

Speaker 1

说到VO3的关键特性,音频是不是重点?能详细聊聊音频部分吗?

What if one of the key things about v o three is is the audio? Can we go into the audio a bit more?

Speaker 0

当然。我们刚发布了Gemini文本转语音API,支持生成包含可调控参数的多语言音频。

Yes. We even just released a Gemini text to speech API that allows you to generate audio, including steerable audio in multiple languages.

Speaker 1

所以这是不包含图像的纯音频输出?

So without the images to go with it, just audio only?

Speaker 0

仅仅是音频,但却是非常、非常有表现力的音频。而且你还可以拥有不同语言的多位说话者。所以我相信你可能之前已经见过Notebook LM与播客的结合。是的,那些生成的播客。

Just audio only, but really, really expressive audio. And you can also have multiple speakers in different languages. So I believe you might've seen the Notebook LM before with the Podcasts. Yes. The podcasts that were generated.

Speaker 0

这让你能够利用多位说话者或单一说话者,创造出可定制且相似的体验。

This allows you to create customizable and similar experiences using multiple speakers or a single speaker.

Speaker 1

让我稍微回顾一下,因为我们确实有机会与那些研究WaveNet的研究人员交谈过。

Well, me bring you back for a second because we did actually get to talk to the researchers who were working on WaveNet.

Speaker 3

哦,是的。

Oh yes.

Speaker 1

这就像是现在,我是说,仅仅四年前。是的。这就是他们当时的进展,因为他们所做的是训练了一个模型。这是神经网络,对吧?用的是我的声音。

This is like now, I mean, only four years ago. Yes. And this is where they were at that point because what they did is they trained a model. This was neural networks, right? Using my voice.

Speaker 1

这就是他们达到的效果。嗨,大家好。我是一位对人工智能充满兴趣的数学家、作家和播客主持人。从那时起,事情发生了哪些变化?

And this is where they got to. Hi there. I'm a mathematician, author and podcaster who's fascinated by artificial intelligence. How have things changed since then?

Speaker 0

是的。事情已经发生了显著的变化。当WaveNet——当时是开创性的——首次被创造出来时,你需要为每一项任务专门训练单独的模型。我自己大约在2009年2月或2010年2月开始接触机器学习,那时非常痛苦,因为你必须获取所有这些特定用途的数据集,还得清理它们。

Yes. So things have changed significantly. When WaveNet, which was pioneering at the time, was first created, you would need dedicated single task models for each one of the things that you were trying to do. And so I I started doing machine learning, I think, around 02/2009, 02/2010, and it was extraordinarily painful because you had to acquire all of these special purpose datasets. You had to get them cleaned up.

Speaker 0

一般来说,一旦你获得了训练数据并训练了模型,你还得监控诸如数据漂移之类的问题。如果随着时间的推移有任何变化,你就得从头开始重新训练模型。WaveNet当时是一个专门用于生成这些听起来非常真实的声音的单一任务模型,但它无法完成其他任务,比如可控音频。你不能说‘给我这个音频片段,用这种风格,并且用德语’。而我们最近的模型在设计上就更加可控了。

And in general, once you got your training data, you trained your model, you would have to monitor for things like data drift. Like, if anything changed over time, you would have to retrain the model from scratch. And so WaveNet was a dedicated single task model for generating these really realistic at this time sounding voices, but it it couldn't do other things like steerable audio. Like, you couldn't say, give me this audio clip in this kind of style and do it in German. Whereas with our our recent models, they're a lot more steerable by design.

Speaker 0

所以你可以给出关于风格、语言、暂停指令或说快一点、说慢一点等各种指示。

So you can give instructions about style, about the language that you're speaking, about pause instructions or speak quickly, speak slower, all sorts of things.

Speaker 1

那么WaveNet有多少内容实际上被融入到了这个模型中?还是说,一旦大型语言模型和变换器出现,你们就基本上重新开始了?

So how much of the WaveNet, I mean, even has actually ended up feeding into this model? Or is it sort of a, you kind of started again once large language models and transformers came on the scene?

Speaker 0

用于创建WaveNet模型的大量代码,其架构对我们的Gemini系列有些不同,但所使用的数据绝对是重新利用的。因此所有创建的示例,比如这是文本输入,这是音频输出,你还可以通过描述音频的风格、语调或声音的情绪来丰富这类数据集。所有这些对Gemini来说都极其有用。

A lot of the code that was used to create the WaveNet model, the architectures are a little bit different for our Gemini family, but the data that was used is definitely repurposed. And so all of the examples created or this is the text input, this is the audio output, you can also enrich those kinds of datasets with the descriptions of the style of the audio or the tone or the temper of the voice. All of that is incredibly useful for Gemini.

Speaker 1

那就继续吧。展示给我看它是如何工作的。给我一些例子看看。

Go on then. Show me how it works. Show me give me some examples.

Speaker 0

如果我们转到AI Studio,你可以在这里看到一个很棒的实验场,可以尝试最新发布的Gemini模型。

So if we go over to AI Studio, you can see here this great playground for experimenting with and trying out the latest Gemini models as soon as they're released.

Speaker 1

那么要创建音频... 是旁边那个看起来像声波的图标吗?

And so to create audio Yeah. The one that looks like an audio wave on the side?

Speaker 0

有趣的是,对于文本转语音模型,你需要进入生成媒体,然后选择Gemini语音生成。你会进入这个文本转语音界面,可以指定不同的说话者、不同的声音,以及每个说话者的风格指令。

So interestingly, for the text to speech model, you would go to generate media, and then you would go to Gemini speech generation. And you're launched into this text to speech UI where you can specify the different speakers, the different voices, and then also the style instructions for each one of the speakers.

Speaker 1

那我们来想一个提示词吧,因为我特别感兴趣的是不同的情感。比如你让它说'我在等你',然后我们尝试不同的情绪。

Let's think of a prompt then because I I I'm particularly interested in this different emotions. So what about if you get it to say something like, I was waiting for you, And then we try a different emotion.

Speaker 0

好的。我们将在系统指令中指定,用友好的语气说话,就像在问候刚回家的亲人一样。很好。然后开始输入提示词:我在等你。

Alright. So, we're going to specify in the system instructions, speak in a friendly tone, like you're greeting a relative who just came home. Nice. And then start typing a prompt. I was waiting for you.

Speaker 0

然后点击运行。

And then hit run. I

Speaker 3

我在等你。太棒了。

was waiting for you. Amazing.

Speaker 1

友好的。不过我们能试试不同的吗?可以... 让它更浪漫一点吗?

Friendly. Can we try something different though? Can you Yes. Make it make it more romantic?

Speaker 0

好的。用浪漫的低语语调,非常气音感,只说用户输入提示中的那些词。

Yes. So speak in a romantic hushed tone, very breathy, and only say the words that the person puts into the prompt.

Speaker 3

我一直在等你呢。哦。

I was waiting for you. Oh.

Speaker 1

好撩人。

Saucy.

Speaker 0

非常撩人。我还特别喜欢这个界面能显示关联的思考过程。可以看到Gemini在生成这段语音回复时,整个思考路径是如何推进的。

Very saucy. I also love that you can see in this UI the thoughts associated. So Gemini's thinking process as it's going through the path of creating this audio response.

Speaker 1

它说什么了?

What does it said?

Speaker 0

上面写着:我已处理输入内容,精准定位到需要使用的短语。当前任务核心是传递这个特定短语。我已细致识别出所需的浪漫低语气音效果,下一步是生成音频。这就像给了你分步指南,告诉你如何整合所有这些...

It says, I've processed the input. I've pinpointed the exact phrase I need to use. The task core now centers on delivering this specific phrase. I've meticulously identified the romantic hushed and very breathy tone required, and the next step is generating the audio. So it really gives you, like, step by step instructions of how to incorporate all of

Speaker 3

反馈内容。

these responses.

Speaker 1

能再来点别的吗?我可以...我们能试试吗?

Can do some more? Can I can we do Yes?

Speaker 0

当然可以做愤怒语气。让我们再次修改系统指令。比如说...

You can definitely do angry. So let's change the system instructions again. And so let's say

Speaker 1

有人约会迟到了。

Someone's late for the date.

Speaker 0

是的。那我们就试试吧。

Yes. And so let's try.

Speaker 1

这太有趣了。

This is so much fun.

Speaker 3

是的,确实。我一直在等你。

Yes, it is. I was waiting for you.

Speaker 1

哦,她生气了。

Oh, that She's angry.

Speaker 0

她非常生气。没错。

She is very angry. Yep.

Speaker 1

你能表现出悲伤的样子吗?就像

Can you do like grieving? Like a

Speaker 0

悲伤?

Grieving?

Speaker 1

对。就像失去了挚爱之人那样?

Yes. Like a lost loved one?

Speaker 0

好。那我们再试一次,用哀伤的语气说话,就像你刚失去一位亲人。只说提示词里给出的内容。我一直在等你。点击运行。

Yes. So let's try another stream and speak in a grieving tone like you have just lost a loved relative. Only say the words the person puts into the prompt. I was waiting for you. Hit run.

Speaker 0

I

Speaker 3

正在等你。

was waiting for you.

Speaker 0

哦,所以我觉得那样悲伤感会少一些,你或许可以稍微调整一下提示词试试看。

Oh, and so I I feel like that was a little bit less grieving, so you could probably experiment a little with the prompt.

Speaker 1

现在,你能把它们变成法语吗?我们试试。

Now, can you can you make them French? Let's try.

Speaker 3

I

Speaker 1

不会说法语。是啊。我是说,当然可以。

don't speak French. Yeah. I mean, sure.

Speaker 0

对。太棒了。

Yeah. Excellent.

Speaker 1

这些都是在AI Studio里完成的?

So this is all in AI Studio?

Speaker 0

这些全都是在AI Studio里完成的。

This is all in AI Studio.

Speaker 1

现在大家都可以使用这个工具了吗?

Which is available for people to play with?

Speaker 0

是的。今天起所有人都可以免费使用它来实验、尝试各种功能。更酷的是,如果你是开发者,有个叫'获取SDK代码'的按钮,点击后它会将你在UI界面里做的所有操作转换成代码形式。无论你需要Python、TypeScript还是其他项目开发语言,要复制你在AI Studio的操作,只需点击获取SDK代码,就能立即开始你的开发之旅。

Yes. It is available today for free for people to experiment, to try things out. And even cooler, if you're a developer, there's a little button called get SDK code that whenever you click it, it gives you everything that you just did in the UI but in code form. So if you wanted it in Python, in TypeScript, in any of the languages that you're using in your personal project, all you have to do to replicate what you just did in AI Studio is click get SDK code, and you're kind of off to the races.

Speaker 1

绝对令人惊叹。是的。在AI Studio里有哪些功能是Gemini应用里做不到的?

Absolutely extraordinary. Yeah. Are there other things that you can do in AI Studio that you can't do in the Gemini app?

Speaker 0

Gemini Live本质上就是将Project Astra直接集成到了AI Studio的用户界面中。

So Gemini Live is kind of Project Astra baked directly within the UI of AI Studio.

Speaker 1

说到Project Astra,我们曾与Greg Wayne专门做过一期节目。这是实时视觉理解技术,可以说是打造通用AI助手的首次尝试。

And Project Astra, of course, we did a whole episode on this with Greg Wayne. This is the real time visual understanding, the sort of the first attempt at a universal AI assistant.

Speaker 0

完全正确。它能看见你所见的一切,可以实时与你对话,支持多语言交流。你还能在这个流程中添加额外工具调用。比如想让Gemini获取最新资讯,或是与日常使用的应用(如日历、文档、表格、Gmail等)交互,通过Gemini Live API就能授权它使用所有这些功能。

Absolutely. It can see anything that you see. It can talk to you in real time. It can talk to you in multiple languages, and you can also add additional tool calls to this process. So if you wanted Gemini to be able to pull out and extract up to date information, if you wanted it to be able to interact with any of the apps or any of the products that you use every day, something like Calendar or Docs or Sheets or Gmail, you can give it access to be able to use all of those features within the context of the Gemini Live API.

Speaker 1

那么,给我举个实际运作的例子吧。

So okay. Give me an example then of this working.

Speaker 0

好的。我们先开启谷歌搜索的实时数据支持,然后与Gemini共享屏幕。你可以分享屏幕画面或摄像头影像,当然还能与模型对话。现在来看看我们和Doug Eck看过的那段视频,听听它的分析。

Yeah. So let's turn on grounding with Google search and then also share screen with Gemini. So you can share your screen, you can share your webcam, and of course, you can talk with the model. And let's take a look at that video that we saw with Doug Eck Yep. And see what it has to say.

Speaker 0

嗨Gemini,能告诉我你屏幕上看到什么吗?

Hey there, Gemini. Could you tell me what you see on the screen?

Speaker 3

我看到正在播放的YouTube视频。一位红发女性正在讲话,屏幕字幕显示'汉娜·弗莱,欢迎来到谷歌DeepMind播客'。视频内容似乎与人工智能和创造力有关。

I see a YouTube video playing. A woman with red hair is speaking, and subtitles are displayed on the screen saying, Hannah Fry, welcome to Google DeepMind the podcast with me. The video appears to be about AI and creativity.

Speaker 0

太棒了。说说你认为视频可能涉及的内容。另外,谷歌DeepMind最新面向创意团队的模型有哪些?你看它正在展开思考,逐渐意识到...

Amazing. So tell me a little bit about what you think the video might cover. And also, what is Google DeepMind's latest models for creative teams? And so you can see here it's expanding out its thoughts, and it's realizing

Speaker 3

根据视频标题《AI,你的新创作缪斯》以及主持人对谷歌DeepMind播客的介绍,该视频很可能探讨AI在创意领域的潜力,可能包含AI如何启发或辅助创意工作者的讨论。关于谷歌DeepMind最新面向创意团队的模型,他们推出了多种工具:视频生成的Veil 3、图像创作的Imagen 4、音乐合成的Lyria 2,还开发了AI电影制作工具Flow,并正在研发应用于游戏娱乐的世界建模AI模型。这些进步旨在赋能艺术家和创作者。

Based that it's on the video title AI, Your New Creative Muse and the speaker's introduction of the Google DeepMind the podcast, it's likely the video explores the potential of AI in creative fields. It may feature discussions on how AI can inspire or assist creative professionals. Regarding Google DeepMind's latest models for creative teams, they've introduced various tools, including Veil three for video generation, Imagen four for image creation, and Lyria two for music composition. They've also developed Flow, an AI filmmaking tool, and are working on world modeling AI models with applications in gaming and entertainment. These advancements aim to empower artists and creators.

Speaker 1

所以这个Gemini Live功能,安卓手机上也有的对吧?是的。其实我前段时间在手机上发现了这个功能,一直在频繁使用。咱们来试试看。稍等。

So this stuff, this Gemini Live stuff, you also have it on Android phones, right? Yes. So actually, I spotted this on my phone a little while ago and I've been using it extensively. Let's give it a go. Hang on.

Speaker 1

Gemini,你现在能看到什么?

What can you see right now, Gemini?

Speaker 4

好的。我看到一位金发女士坐在桌前,面前放着麦克风。她双臂交叉面带微笑。桌上有电脑键盘鼠标,背景处有一盏台灯。

Okay. I see a woman with blonde hair sitting at a desk with a microphone in front of her. She has her arms crossed and is smiling. There's a computer keyboard and mouse on the desk and a lamp in the background.

Speaker 0

太神奇了。

It's amazing.

Speaker 4

还能告诉我些什么?

What else can you tell me about?

Speaker 1

这个功能诊断我生病的家养植物特别有用,真的难以形容。

That has been so useful in diagnosing my sick house plants. I can't tell you.

Speaker 0

哇哦。就像是...它就像有个实时解说员跟着你日常生活一样酷。还有件特别特别有趣的事是把它当作编程搭档。我经常用谷歌的Colab工具——这是个基于Python代码的笔记本环境,可以运行代码。

Oh, wow. Like, it's and it it's also just been so cool to have something that feels like a real life commentator on, you know, as you go about your daily life. There's been something really, really fascinating too about using it as kind of a a pair programmer. One thing that I commonly use this for is we have a tool at Google called Google Colab, which is a notebook based environment with Python code. You can run code.

Speaker 0

可以执行代码。而且Gemini还理解Colab功能。比如想分享这个标签页时,你甚至可以说'Gemini,现在屏幕上看到什么?如果看到代码请解释其功能'。

You can execute it. But Gemini also understands Colab features. So if we want to share this tab instead, you could even say, hey, Gemini, what do you see on the screen now? And if you see any code, please explain what it does.

Speaker 1

哦不错,是的。

Oh, nice. Yeah.

Speaker 3

我看到屏幕上有个Google Colab笔记本,标题是《充分利用您的Colab订阅》。内容似乎是关于优化Colab使用的技巧,重点在更快GPU和更大内存。可见两段代码片段:第一段检查是否连接GPU并显示GPU信息;第二段用psutil库检测运行时可用内存总量,并以GB为单位显示。

I see a Google Colab notebook on the screen titled Making the most of your Colab subscription. It seems to provide tips for optimizing Colab usage, focusing on faster GPUs and more memory. There are a couple of code snippets visible. The first one appears to check if a GPU is connected and display The GPU second code uses the psutil library to determine the total amount of RAM available in the runtime and presents that in gigabytes.

Speaker 0

这就像拥有一个能理解所有事情的得力助手。

So it's like having a helpful assistant that somehow understands every single thing.

Speaker 1

任何你可能正在查看的内容。

That you can possibly be looking at.

Speaker 0

完全正确。而且它还能以同理心回答你可能提出的所有问题。

Absolutely. And that also is capable of answering in an empathetic way all of the questions that you might have.

Speaker 1

这很棒。我已经和学生们在用Google Colab了。哦,太棒了。所以谢谢你。这会非常有用。

So this is good. I mean, already use Google Colab with my students. Oh, amazing. So thank you for that. That's gonna be that's gonna be very helpful.

Speaker 1

是的。但对于不一定能自己编码的人,AI Studio里还有另一个选择。

Yeah. But for people who can't necessarily code themselves Yes. There's another offering within AI Studio.

Speaker 0

没错。除了能将Gemini API嵌入到像Cursor、Windsurf、Copilot或其他许多编码IDE中——如果你尝试写代码的话,我们还推出了一个叫'新构建功能'的东西,让你可以直接用Gemini构建应用。'用Gemini构建应用'这部分独特之处在于,所有在这个用户界面中生成的代码都针对最新的SDK和我们用于与Gemini模型交互的最新代码进行了超优化。

Absolutely. So in addition to being able to take the Gemini APIs and embed them within tools like Cursor or Windsurf or Copilot or any of the other many, like, coding IDEs that you might have access to if you're attempting to write code, we also have something called the new build feature that allows you to build apps directly with Gemini. The build apps with Gemini section is unique in that all of the code that gets generated within this UI is hyper optimized for the latest SDKs, the latest kind of code that we have for interfacing with the Gemini models.

Speaker 1

换句话说,即使你这辈子从未写过一行代码,只要你在这里提示Gemini为你构建一个应用,你就能得到漂亮的代码。

So just to put that another way. Yeah. Even if you have never written a line of code in your life. Yes. You are gonna have slick code if you just prompt Gemini here to build you an app.

Speaker 0

完全正确。而且它会使用最新的模型和功能,生成非常健壮的TypeScript代码,并在过程中解决任何错误。所以如果模型遇到问题,它能循环回溯,修复错误,让模型实现的应用保持清晰连贯的状态。

Absolutely. And it will be using the latest models, the latest features from the models. It will be generating really, really robust TypeScript code, and it will also be resolving any errors along the way. So if the model hits any problems, it's able to cycle back through, fix the error to get the model's kind of implementation of an app in a clear and coherent state.

Speaker 1

自修复代码。

Self healing code.

Speaker 0

自修复代码。而你只需要描述你想要什么。

Self healing code. And all with you just having to describe what you would like to see.

Speaker 1

好的。我过去确实建过一些网站,没错。说实话,即便用那些号称能提速的工具,以前也得花上好几周、好几周、好几周的时间。

Okay. So I am someone who has built websites in the past, Yes. And honestly, even with the tools that are supposed to make it quicker, weeks and weeks, weeks and weeks and weeks this would take.

Speaker 0

完全同意。

Absolutely.

Speaker 1

而且我做的网站还很糟糕。不过这对开发者意味着什么?

And my websites were rubbish. What does this mean for developers though?

Speaker 0

我认为对开发者来说,这意味着他们能更专注于构建、构思和产品体验,而不是像现在这样把大量时间花在那些并不令人兴奋的事情上——比如升级代码库版本、给仓库添加类型注释、定期检查安全漏洞确保代码不受威胁。这些工作就像打扫公寓保持应用可持续性一样缺乏乐趣。这类AI模型的潜力在于,它们将为开发者创造更多机会去构建更具野心的系统。

I do feel like for developers, this means that they can focus more on building and ideation and kind of the product experience as opposed to the daily life of the developer right now is a lot of things that aren't necessarily the most exciting in the world. So you might be upgrading a code base from one version to another, or you might be adding typing to a repository, or you might need to periodically check for security vulnerabilities and make sure your code base aren't susceptible to them. And all of these things, you know, they aren't joyful. It's kind of like tidying up an apartment to make sure that your app is kind of sustainable and maintained. I think one of the the promises of these models and these kinds of capabilities is that there are so many more opportunities for software developers to really build more ambitious systems.

Speaker 0

更重要的是,它为更多人打开了学习大门,让更多人能受到创作过程的启发,把自己的想法变成现实。

And even more importantly, it opens up the door for more people to kind of learn and get inspired by this whole process of creating and getting something out into the world.

Speaker 1

那你觉得这些工具组合起来,会从根本上改变我们对人类创造力的认知吗?

Do you think all of these tools together then, do you think that they will sort of fundamentally change the way we think about human creativity?

Speaker 0

毫无疑问。人类创造力即将迎来爆发式进步,未来人人都能成为创作者——不仅是单一领域的创作者,还能跨学科拓展。想想看,物理学家、化学家、生物学家、历史学家或音乐家突然都能将自己的想法或项目转化为可分享的数字形态,这简直充满魔力。

Absolutely. Like, human creativity is about to have this explosion of progress, and there's this promise of everyone being able to become a creator and not just a creator in one specific discipline, but to also be able to expand that out into many other disciplines. So I I think that there's real magic in having people who are scientists maybe in the physical sciences or chemistry or biology or people who are historians or musicians being able to suddenly, like, get all of their ideas or their projects into a digital form that could be shared with others.

Speaker 1

佩吉,这太令人着迷了。非常感谢你的分享。

Paige, absolutely fascinating. Thank you so much for joining me.

Speaker 0

很荣幸参与,谢谢邀请。

It was awesome to be here. Thank you so much.

Speaker 1

有了这些新工具,突然感觉所有碎片都拼合了。要知道,我们之前采访的研究人员都是分别探讨音频、视频、语言等单项技术。但当这些元素被无缝整合时,产生的化学反应截然不同。我来这里采访多年,见证了整个发展历程,可即便现在看着这些工具落地、想象着可能性,我依然感到无比震撼。

With these new tools, it feels like suddenly everything has clicked together. You know, until now, we've spoken to researchers about every individual element of these new releases, audio, video, language. But there is something so different about having all of those elements integrated and working seamlessly together. And I have been coming here for years, right? I have spoken to the people all along the way, but even still seeing these tools come to life and imagining the possibilities, I still feel completely wowed.

Speaker 1

老实说,我现在迫不及待想跳上回家的火车,去尝试那些我一直想构建但从未有时间实现的东西。您正在收听的是由我——汉娜·弗莱教授主持的《谷歌DeepMind播客》。如果您喜欢本期节目,请订阅我们的YouTube频道或在您喜爱的播客平台上留下评论。当然,我们还将推出更多涵盖各类主题的节目,敬请关注。下次见。

And to be honest with you, I'm now itching to get on the train home and just try out all of the things that I have always wanted to build but never had time to. You have been listening to Google Deep Mind the podcast with me Professor Hannah Fry. If you enjoyed this episode then do subscribe to our YouTube channel or leave a review on your favourite podcast platform. And of course, we have plenty more episodes on a whole range of topics to come, so check those out. See you next time.

关于 Bayt 播客

Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。

继续浏览更多播客