本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
我认为在Genie三代的研发中,我们确实尝试在所有维度上突破极限,对吧?我们看到模型的生成质量更强大。如果你关注分辨率、交互时长、下一帧的生成速度这些指标,会发现显著提升。若将这些维度综合考量,整体改进幅度可达百倍。
I think in Genie three, we really tried to push it to the limit across all of the dimensions, right? So we see that we have models that are more capable in terms of the quality of their generation. We see that, like, if you look at the resolution, the duration of the interaction, how fast the next frame can be generated, you get a very significant If you multiply all of those dimensions, get quite 100x improvement.
大家好,欢迎收听twiml.ai播客新一期节目。我是主持人Sam Charrington。今天邀请到谷歌DeepMind的研究员Shlumi Fruchter和Jack Parker Holder,讨论最新发布的Genie三代模型——这个令人印象深刻的世界模型最初正是通过一年前与Ashley Edwards的访谈在本播客首次亮相。我非常期待深入探讨Genie三代。
Alright, everyone. Welcome to another episode of the twiml.ai podcast. I am your host, Sam Charrington. Today, I'm joined by Shlumi Fruchter and Jack Parker Holder, researchers at Google DeepMind to discuss the recent release of the Genie three model, which is an impressive world model that we first introduced to you here on the podcast and our conversation with Ashley Edwards almost exactly a year ago. I'm super excited to dig into Genie three.
Jack和Shlomi,欢迎来到播客。
Jack and Shlomi, welcome to the podcast.
谢谢邀请,很荣幸参与。
Thank you. Thanks for having us. Excited to be here.
这次访谈很有挑战性,因为我们不久前才讨论过Genie,但我不想预设听众都听过那期节目或了解这个项目。所以我们会从基础概念开始,探讨这个项目的起源与价值所在。不过在深入之前——
This is an interesting interview to try to dig into because we covered, you know, Genie relatively recently, but I don't want to assume that people, you know, have listened to that interview or even know about Genie at all. So, we're gonna start a little bit from the beginning and dig into the project and where it comes from and why it's exciting. I'd love
想请两位先做个自我介绍,分享
to have each of you just introduce yourselves before we dig in though and share
你们进入机器学习研究领域的历程,以及当前最感兴趣的研究方向。
kind of the highlights of your path to ML research and what you're most excited about, you know, what you research, that kind of thing.
Jack,不如你先开始?
Jack, why don't you get us started?
好的。说来可能有点奉承——大约十年前我在金融行业工作,利用晚间业余时间攻读硕士学位,2017年前后经常收听你们的播客。那时我开始接触ML研究,主要研究强化学习中的进化方法,因在纽约办公还曾与谷歌Brain团队合作。后来我攻读博士学位,专注开放式学习方向的强化学习,逐渐涉足世界模型领域。
Awesome. Thanks. So without, you know, being, you know, too much of a suck up, my path to my path was that I was working in finance just around ten years ago, and I did a master's part time in the evenings after work and actually resisting to your podcast quite a lot around 2017 time. I got into ML at the time research and doing actually evolutionary methods for reinforcement learning, working with some folks from Google Brain as well because I was in New York and there was an office there. And then I decided to do a PhD where I focus on open ended learning, reinforcement learning still, and then got a bit into world models.
博士毕业时,我愈发确信这些理念的结合将产生重大突破。加入谷歌DeepMind后,我先参与了Xland环境下的Adaptive Agents项目,之后便主导Genie项目至今。我所在的开放式研究团队虽涉及多个领域,但通过Genie项目,我们始终探索以世界模型实现开放式学习的路径。
And then by the time I finished my PhD, I was increasingly convinced that the combination of these ideas would be the really powerful big thing to do. So after my PhD, I joined Google DeepMind and then worked a little bit on some on on a project called Adaptive Agents, which was in the Xland environment. And then after that, pretty much started Genie. So I've been doing that for a few years. I'm in the open endedness team, which encompasses some other areas too, but we've been focusing on this idea of using world models as a path to open endedness as part of the Genie project.
太棒了。太棒了。你呢,Shlomi?
Awesome. Awesome. How about you, Shlomi?
其实我最初的编程经历始于青少年时期,那时我在开发游戏引擎。具体来说是3D引擎,更多是源于尝试模拟各种效果的世界,比如光照效果、液体效果等等。我在这个领域工作了一段时间,非常喜欢这种视觉领域。后来我加入了谷歌,成为Google Duplex团队的一员。我们在Duplex项目中所做的,基本上与视觉内容大相径庭,因为它主要是通过电话完成各种任务。
So I started actually my first programming kind of experience was in my teen years then developing game engines, actually. So three d three d engines and coming more from from this kind of like a world of trying to simulate effects and like in lighting effects, liquid effects, etcetera. And so I've been working in this space for some time, and I really like this visual, visual domain. Then I, I joined Google and I've been on the Google Duplex team. So what we've done in the Duplex project is basically, it was pretty much very different from, from visual stuff because it was mostly, getting stuff done over the phone.
Duplex项目,如果人们还记得2018年Google I/O大会的话,当时大家的反应是‘哇,这简直像是AGI(通用人工智能)的时刻’。虽然我不认为它达到了那个程度,但确实是个了不起的成就。
So the, the Duplex project was, if people remember from Google IO twenty eighteen actually, it was like, people thought, okay, wow, you Yeah. Have It was one of the, is it AGI moments? But I don't think, you know, probably not. But I think it was definitely the nice thing.
具体来说,这个项目就是谷歌会代表用户打电话给餐厅和发廊预约,对吧?
And specifically, this was the project where Google was going to call restaurants and hairdressers to make appointments on your behalf, right?
是的。当我们启动Duplex时,目标其实是探讨一个问题:我们能否在今天就打造一个通过电话与人交流的机器人,且让对方察觉不到这是机器?某种程度上可以说是电话版的图灵测试。我们发现,虽然当时要实现完全通用的电话对话还不可能(直到最近大语言模型出现才改变),那时还是RNN、LSTM的时代,远没有Transformer。但我们成功开发出了至少能高效完成特定任务的产品。
Yeah. So when we, when we started Duplex, the, the goal was, was actually a question of, can we build already today a bot that over the phone with people without, you know, them feeling this is actually a machine, right? And hitting this kind of like maybe, you know, over the phone Turing test, if you want. And what, what we found is that although having a completely general conversation over the phone was not achievable until, you know, very recently when I was with LLMs, already then, and that was an era of RNNs, LSTMs, and definitely not transformers. And then basically we were able to to develop something that was at least for for this particular task, accomplished a lot of tasks.
这算是我第一次接触机器学习,非常有趣,因为它既深入——具有很强研究性质,又实际部署在现实世界。我们最终将其扩展到15个国家,处理了数亿通电话。
So that was kinda like my first touch with machine learning and it was very interesting because it was also deep, like it was very research oriented, but also had the deployment in the real world, we ended up scaling that to call hundreds of millions of calls over the, for overall, like 15 countries, so basically-
哇哦。
Oh wow.
确实做了很多,虽然知道的人不多,因为这些电话主要是为更新谷歌地图而打给商家的。后来我也持续关注LLM领域的发展,比如GPT,谷歌内部早期就使用Mina和Lambda等模型。我们将这些技术整合到Google Duplex中,但某天我感觉自己的视觉基因又开始呼唤我——图像扩散模型的革命实在太吸引人了,感觉这是个巨大的机会,技术正趋于成熟。
Did quite, quite a lot of yeah, it's less, less people are aware of it, but because it's not like, it was mostly calling businesses to update Google Maps. And I think when, when you saw, I also of course follow the things that happened on the LLM and, you know, from GPT and internally at Google, we use, we had Mina and Lambda and other models very early on, and then we kind like integrated this technology into Google Duplex, but at some point I just felt that maybe my visual roots kind of like came to play again, and I felt like I, the kind of the revolution in image diffusion models was really appealing to me. I just felt like this is a huge opportunity. It's really something like it hits, it's hitting some kind of a point of maturity. Yes.
之后我开始研究视频模型。其中一个产物是Game Engine这个副项目,我和几个朋友(其中一位是Duplex联合创始人)提出了一个问题:能否完全用神经网络实时模拟一个现有游戏?此外我也参与了VEO、VEO2和VEO3的开发。
Since then I've been working on, video models. And one incarnation of that was, game engine, which, which was a of a side project. We were very excited. Me and a few friends, actually one of them, was on Duplex with me, was actually the founder of Duplex together we basically asked the question of, is it possible to simulate, as an existing game, but in real time completely by a neural network, right? And then I've been, also working on on VEO, VEO two and VEO three.
在Game Engine时期,我开始与Jack交流,对他的工作线印象深刻,这个话题我们后续可以详谈。
And but around the time that of game engine, I started talking to Jack, I was very impressed with the genuine line of work, and, we can, like, we can talk more about that, of course.
关于Genie最令人兴奋的一点,就是这个世界模型的概念。杰克,或许你可以深入谈谈这对你意味着什么,以及你如何看待世界模型融入更广泛的人工智能模型发展轨迹,比如基于Transformer的模型,或者你对此的任何思考。对你而言,世界模型这一理念究竟包含了哪些内容?
One of the things that, you know, people are most excited about in looking at Genie is this concept of, you know, world model. Maybe, Jack, you can kind of dig into like what that means for you and how you see the world model fitting into kind of the broader, you know, trajectory of of, AI models, transformer based models, how or however you think about that. What all what all is captured in this idea of a world model for you?
当然。这是个很好的问题,从某个层面来说很容易回答,但也可能涉及一整本不同哲学观点的书。就我个人而言,世界模型的定义最近其实略有变化。但大约一年前我会说,世界模型本质上是强化学习范式中对MDP(马尔可夫决策过程)进行建模的模型。
Sure. Yeah. That's a that's a great question, which is, I guess, on one level quite easy to answer, but also could be a whole old book probably of different philosophies. So for myself, a world model the definition has actually slightly changed recently, I think. But what I would have said until about a year ago is a world model is essentially a model from the reinforcement learning paradigm that models an MDP.
它通过当前状态和行动来预测下一状态。这个概念由来已久,世界模型其实就是基于模型的强化学习中的环境模型——它在模拟环境本身,对吧?
So it models taking the state and action, and it predicts the next state. And this idea has been around for a while. A world model essentially is the model in model based reinforcement learning. It's modeling the environment. Right?
早在九十年代初,Jagged Spit Hooper就有关于循环模型的论文,Rich Sutton的Diner论文也差不多同期发表。这算是基于模型的强化学习研究的起点。而真正让我产生共鸣的是2018年Harrington Hoover的工作,那时我开始对这个领域产生兴趣。当时我正在研究各种强化学习任务。
And so in the early nineties, I think, Jagged Spit Hooper had a paper about more like recurrent models. And then also the Diner paper from Rich Sutton came out around that time as well. And that was kind of the the starting point for this, like, direction of research, model based reinforcement learning. And then for me, it was really the Harrington Hoover work in 2018 that was around the time I mentioned where I was getting interested in this field in general that really resonated. And I was looking at looking at sort of reinforcement learning tasks.
那时候都是些Mojoco任务,比如小半人马机器人这类。读到世界模型论文时,它基本是说用少量离线样本就能模拟环境——只要足够准确地预测下一状态,就能在这个模型里训练策略,再迁移回真实环境。而这些策略从未在真实环境中训练过。
Right? And they they were at the time these Mojoco tasks, like little half cheetah and these kind of ones. And then looking at the world models paper, and it was it was basically saying that with a few offline examples, we could actually simulate an environment. So predicting the next state sufficiently well that we could then train policies in that that model and then transfer them back to to a real environment. And the policy had never trained in the real environment.
这显然非常酷,但他们当时的设定是你已拥有真实环境的数据。所以尽管能在模型里训练很厉害,理论上你直接在真实环境训练也能得到相同结果,对吧?
And obviously, that's super cool, but the setting that they had was basically that you had data from the real environment. So although it was really cool that you could do this, you could also just train in the real environment and get the same result in theory. Right?
但首先需要收集那些数据。
And Then she needed to collect that data.
没错。但你是能做到的,对吧?
Exactly. Yes. But but you could. Right? You had the ability to do it.
这就是世界模型前几年的范式——证明我们能在日益复杂但已知的环境里做到这一点。如果我们不想用基于模型的强化学习或世界模型,完全可以用分布式RL算法通过其他方式获得很好效果。虽然世界模型方法仍有优势(比如样本效率更高),但并没有解决原本无法解决的问题。
And that was kind of the paradigm that world models were in for a few years is that it was showing that we could do this at all for increasingly complex environments where we actually had the environment. So if we didn't want to do model based reinforcement learning and didn't want to use well models, we could have just used like a distributed RL algorithm and probably got really good performance a different way. And there were still benefits to doing the world model approach. Maybe it was more sample efficient, but it wasn't solving something you couldn't otherwise solve. Right?
不过回到Shlomi关于文生图模型的评论——我们当时从不同角度为此兴奋。对我来说这意味着:既然文生图能做到这么好(大概是四年前Imagine模型问世时),那么视频生成迟早也会实现。
But then kind of going a little bit back to to, like, Shlomi's comment about this text to image models. I mean, I think we are both excited about that at the same time from different angles. Right? And for me, that was sort of saying, look. If we can do text to image this well, and I think it was, what, like, four years ago maybe that, like, Imagine came out, then, okay, video will happen at some point.
那么在视频之后,也许我们就能构建世界模型了,对吧?我们将能够用大型数据集模拟任何事物。所以基本思路是,能否将同样的概念应用到我眼中的世界模型——即模拟一个环境,从而我们可以模拟任何环境。
And then after video, maybe we'll be able to get to world models. Right? And we'll be able to simulate anything with large datasets. Right? And and so the idea was basically, can we actually take the same same concept and apply it to what I would view as a world model, which is simulating an environment, and therefore we can simulate any environments.
对吧?因此我们提出了基础世界模型的概念。它类似于世界模型,但不仅限于单一环境,而是像基础模型那样适用于任何可能或新的环境。这是我们坚持的严格定义:基础模型就是这样的存在——给定行动时预测下一状态,这是我们采用的定义方式。但最近,其他类型的模型也逐渐被视为世界模型。
Right? And so that's why we came up with this idea of foundation world models. So it was like a world model, but rather than just being for one single environment, it was any possible environment or new environments like a foundation model. And so this was kind of a hard line we stuck to in that a foundation model is this one thing, foundation well model is one thing, it predicts the next state given actions, and that's the kind of definition that we took. But then increasingly, other kinds of models have been considered to be world models recently.
例如文本生成视频模型。起初我觉得这与我对世界模型的理解不符,但实际上它在不同抽象层面上确实契合。现在我对世界模型有了更宽泛的看法:这些模型根据过去和某种形式的行动来模拟未来,模拟世界的动态变化,对吧?
So for example, video models text to video. And at first, I I it didn't really fit in with how I high how I saw world models, but actually, I think it kind of does at a different level of abstraction. Right? So I think now I have a slightly broader view of world model, which is models that that simulate the future given the past and some form of actions and simulate dynamics of a world. Right?
所以它并非显式建模MDP的每个状态转移(这是我之前的观点),而是以某种方式模拟世界动态,使你能在其中行动干预并获取反事实信息。这支持了规划、策略学习等仿真中的行为。
So it's not explicitly modeling every transition of an MDP, which is why I would have said before, but maybe just simulating the dynamics of the world in some way that you can act and intervene in it and then get counterfactual information. And that enables things like planning or or, like, learning policies and simulation.
Shlomi,你对世界模型的认知是否也经历过类似的转变?
And Shlomi, did you have the same kind of shift in the the way you thought about word models?
是的,我更多是从仿真或视觉角度思考。我认为当前所谓的世界模型非常视觉导向,这也是现有模型的局限之一。但回顾Jack提到的世界模型论文,其定义非常清晰,我认为这是个很好的起点和锚点——毕竟这个术语确实被广泛混用。不过其中也存在某种直觉:当我看到视频模型或图像模型时,当你输入文本提示后,那些像素背后仿佛存在着一个世界,对吧?
So, yeah, I'm coming more from the, I think maybe simulation or I think about more the visual. I I think our what we call word models today are very visual specific, which I think is also one of the limitations of current, what we call word models at the moment. But if you, we go back to the, you know, word models paper, and that Jack mentioned, again, the definition was pretty clear and I think it's a good definition to start from and to kind like encore to, because again, term is definitely used in many, many, ways. But I think there is also a bit of an intuition to it, and which of course can be formalized, but at least from my perspective is like when I started seeing, I think video models, but also image image models, but but when you you write text, you provide a text prompt and then you get, you see some kind of like, it feels like there is a world behind those pixels. Right?
这就是直觉部分——要生成逼真的图像或视频,模型内部必然需要对世界运行方式、物理规律等有一定表征。用外行话说,这就是世界模型的直觉:模型对世界有所理解,而且是视觉化的。
I think that's the, that's the, if you want, the intuitive part. Like for for things to really look like a realistic image or video, then the model probably has to have some, internal representation of what's going on, how the world behaves, our, of physics to some extent. Right? And I think that's, that's what, that's kinda like the, maybe in layman's term, the, the intuition behind the world model. Like the model probably has some understanding of the world, you know, and, and it's visual.
但关键是世界模型不必局限于视觉,不必是生成像素的。在文献中,世界模型可以存在于某些潜空间——它只需能让我们做出决策、预测未来,或许还能进行规划或强化学习式的环境优化。不过视觉领域确实取得了突破,因为扩散模型恰好非常适合图像、视频和音频,这些领域的交叉变得非常明显。我们现在拥有的正是能生成逼真环境的模型,并正推动其发展。
And I think that's, that's key. And, but the, the, the word model doesn't have to be visual. It doesn't have to be something that generates pixels. And I think that's the more the broader like the broader yeah. So I think in in some of the, again, the the the literature, the word model can be some, in some latent space.
世界模型应该是我们能用来决策、预测后续发展,并可能在强化学习框架中进行规划以优化环境交互的工具。回到视觉领域——我认为突破在于扩散模型对图像、视频和音频的卓越适配性,使得这些领域的交叉变得极其显著。我们现在正处于这个阶段:拥有能生成高度真实环境的模型,并持续推动其发展。
It just should be something that we can, we can use to make decisions and predict what's going to happen next. And maybe perform planning in the, you know, or in the RL kind of like, and learn basically how to operate in a more optimal way in an environment. But again, going back to the visual domain, which is kind of like what I think happened is basically the visual domain really worked well because diffusion models just happened to work really well for images and videos and audio as well. And then this intersection between those fields kinda like became very obvious. And I think that's where we are right now, that we just have models that are capable of of generating very realistic environments, and we're trying to push along this thing.
Jack,你也参与了Genie一代和二代的研发。能否谈谈项目的发展轨迹?关于Genie三代,最让我惊叹的是——我记得和Ashley交谈时,下意识以为这个世界是可实时游玩的,结果发现每帧渲染需要约20分钟(具体记不清了),所以并非实时可玩。
Jack, you worked on Genie one and two as well. I'd love to have you talk a little bit about kind of the trajectory of the projects. And maybe for a little bit of context, what's most amazing for me about Genie three is I remember the conversation with Ashley and like, in my mind, I made the assumption that this, this world was like real time playable. And it was like, oh no, it takes like, I don't remember what the specifics was, like twenty minutes between frames, in order to do that. So it's not real time playable.
你看,现在我在看这个东西,短短一年内就实现了全时实时可玩性。不过,能否谈谈这个项目在不同迭代版本中,那些重大的功能性和研究里程碑是什么?
And, you know, now I'm looking at this thing that, you know, in just a year is full time real time, playable. But, you know, talk a little bit about, you know, what the the big, you know, both functional and research milestones have been between these iterations of the project?
当然。我们确实刻意强调过同行作品并非实时系统。挑战在于——当你发布新研究时,总不能开场就说'欢迎来到我们的非实时突破'对吧?这样不够吸引人,没人会以局限性作为开场白。
Sure. Yeah. I definitely think we did try to emphasize it wasn't real time in in both your peers' works. The challenge was like if you're announcing a new piece research, you're not going to say welcome to our like non real time breakthrough because why would it's not like the best way to like you want to make it sound exciting. You don't want to just open with a limitation.
但如果你把局限性藏在演示视频后面,观众根本注意不到。所以我们当时确实面临这个难题。不过实际延迟其实不到二十分钟。
But then if you put the limitation halfway down and you've got a bunch of cool videos, people don't really get that far. Right? So Yeah. So we had to definitely have this challenge. But I will say it was slightly less than twenty minutes.
我记得只有几秒钟。但想想文本生成图像模型...
I think it was only a few seconds. But if you think about like text to image models
我当时在保守估计上确实过度谨慎了。
I totally over rotated on the conservative side there.
我欣赏这种态度,毕竟降低预期超额交付更好。不过我们可能反其道而行了。本质上,文本图像模型需要达到约0.05秒响应才能媲美Gini3的速度——我们在这方面确实取得了惊人突破。话说回来,我强烈建议听众也去听听Ashley Edwards的版本。
I like that because it's better to set the bar a bit lower and under promise and over deliver. But I think we did the other way maybe. But, essentially, like text image models, I mean, they they would need to operate at, like, a twentieth of a second essentially to be at the same kind of speed as Gini three. So I think it is quite remarkable what we what we did to you with that. But going back in time a little bit, so I thoroughly recommend all your listeners to listen to the Ashley Edwards version as well.
初代Genie与第三代截然不同,它更像是基础模型概念的首次验证——一个能生成新世界的模型。要知道此前世界模型的研究都集中在单一领域。
Essentially, Genie one was quite a different beast to Genie three. It was really the first kind of proof of concept of this foundation model model idea. Right? A model that could generate new worlds. Right?
从Maharaj的Smoothievable开创性工作,到Dreamer系列三代演进,这些模型在单个环境建模方面突飞猛进,证明了其能让智能体学会解决复杂任务的能力——这是单领域复杂度的重大突破。而Genie1的核心命题是:我们不关心环境复杂度,只问能否训练出生成全新世界的模型?关键难点在于数据——生成新环境时,我们根本没有目标环境的动作标注数据。
So there'd obviously been amazing progress in world models for single domains starting with the Maharaj Smoothievable work and then things like Dreamer, Dreamer v two, Dreamer v three that could model increasingly complex individual environments and show that you could use them for agents to learn amazing behaviors that could solve really complex tasks. And that's, like, really important work in in one axis, right, of which, like, single domain complexity. Whereas Genie one was basically saying, we don't care about the complexity of the environments per se. Can we train a model that can generate new new environments and new worlds at all? And the challenge with that was on the data side because unlike training from existing environments, we don't have action labeled data from your target environment if you are going to be, like, trying to generate new things.
于是我们收集了无标注视频数据集,创新性地采用潜在动作学习法。这要归功于我职业生涯中多次遇到的幸运——正好遇到技能完美互补的合作伙伴。Ashley研究潜在动作学习超过五年,是该领域的先驱,只不过她原先的研究方向是...
So we basically collected a dataset of unlabeled videos that had no action labels, And we had this kind of neat approach where we learned latent actions. And the really nice thing about that was that it was an example of people and I've been very fortunate about with this a few times in my career, but bumping in someone who had almost a perfect skill set for the thing I was excited about doing and the things I didn't know about. So Ashley had worked on latent action learning for maybe, like, more than five years. Right? She is basically one of the ones who really pioneered that direction but in a different context.
从视频中提取行为进行模仿学习,需要无监督的潜在动作识别。而我们的应用场景恰好与之相反。
So she was working on it to learn behaviors from videos. Right? So you take videos and you want to extract the actions so you can behavior clone or do imitation learning from those videos. You need to learn, like, unsupervised latent actions so you can do that. Whereas we had the opposite kind of setup.
我们并非试图从视频中学习行为,而是尝试从视频中学习一个世界模型,然后利用这个世界模型来学习策略。对吧?这是一种不同的框架。CBPR曾有一篇名为《可玩环境》的论文做了类似工作,作者是Menopass等人。
We were instead of trying to learn behaviors from the videos, we tried to learn a world model from the videos then we could use the world model to learn policies. Right? So it's kind of a different framing. And there had been a paper at CBPR called Playable Environments, which did a similar thing. It was Menopass et al.
这篇论文发表时我正在撰写博士论文。我当时思考着,如何从网络视频中构建这些世界模型?或者说,我们怎样才能实现这个目标?看到那篇论文后,我想这至少是一条可行路径。后来与Ashley交流时,她也认为这个思路确实合理。
And that came out as I was writing my PhD thesis. And I was thinking, like, how am I gonna do how are we gonna do these world models from Internet videos? Or, like, how are we gonna make this possible? And then I saw that paper, and I thought, well, that's at least one path. And then basically got chatting to Ashley, and and she was thinking, well, actually, that does make sense.
我们完全可以采用这种方法,这将是个激动人心的方式,能让后续动作在比之前尝试过的更大规模上运作。对于Genie一代,这算是个相当新颖的想法——我们甚至不确定能否实现,且其范围有限。我们实际训练了两个模型(这点常被忽视):一个基于2D平台游戏数据集,另一个基于RT-1论文中的机器人数据集(那个日常机械臂研究)。两种情况下,都能通过无监督学习获得动作空间,使模型能处理训练数据外的全新图像。
We could definitely go for this approach, and it'd be a really exciting way of getting the later actions to work at larger scale than than she previously tried before. So for Genie one, it was kind of this quite new idea and could we do it at all, but it was limited in scope. So we trained it on we trained actually two models, which people don't really notice as well. So we trained one model on this two d platform, a game dataset, and we trained another model on the the robotics dataset from, I think, the r t one paper, which was this, like, everyday robotics arm. And in both cases, you could unsupervised learning action space that meant that you could give the model a new image that wasn't in the in the training data.
可能是文本生成图像模型创建的图片,也可能是你自己拍摄的照片。然后你就能利用潜在动作将其当作世界来交互。听起来很神奇,但有几点限制:分辨率仅90p。
Maybe it was generated by a text image model. Maybe it was something it's a photograph you took yourself. And then you could, from that point, play it like a world using the latent actions. Now that sounds amazing, but there were some caveats. So it was 90 p.
规模非常小,且仅能维持几秒就会失真。此外当移出画面时,它不会真正生成新内容。我们采用掩码获取方法,但偏移时容易出现模式坍塌——比如平台游戏示例中向右移动时,只会延续平坦平台而非生成有趣的新场景。
So it's very small, and it lasted maybe a couple of seconds before it degraded. And then also, it didn't really generate anything new when you went off screen. Right? So we use this, like, mask get approach, and it tended to have, like, mode collapse when you moved away. So if you look in the platformer examples, when you move to the right, it just kinda continued a flat platform rather than generating, like, exciting new content.
确实有人注意到了这点,我们心知肚明。但当时更觉得‘这居然能运行就很酷了’。这还是个非常早期的项目——说实话刚开始时,几乎没人觉得值得投入。
So people did actually notice that, but we were definitely aware of it. It was more just like, this is kind of cool that this remotely works. And it was very early stage project. Right? Like, essentially, when we started this, no one really thought it was worth doing.
幸运的是我们在Google DeepMind这样的地方,鼓励探索性项目。不过资源投入并不大,就几个人凑合着干。Genie一代我们写了论文,投稿去年夏天的ICML。
And lucky we're in a place like like Google DeepMind where people encourage these more exploratory projects, but it wasn't like a heavily resourced effort. Right? It was kind of a few of us kind of scrapping together. So that was Genie One. That one, we actually wrote a paper on it, and we submitted it to ICML, which came out last summer.
所以感觉上很新,因为从完成研究到论文发表有个周期。Genie二代的情况是:一代论文2024年2-3月发表时,我们已持有结果并开始规划未来,当时视频模型领域整体突飞猛进。
And so that's why I guess it feels very recent because there's a cycle between finishing the work and the paper being published. Right? So it feels much like short time horizon. For Genie two, the setting was basically Genie one paper came out around February, March 2024. We had the results for a while and we'd already kind of started thinking about the future plans and actually got got to work on it.
就像几年前文本生成图像的爆发期,我们清楚看到Genie下一阶段可以追求更大规模——因为视频模型已证明可扩展性。于是从2D游戏转向任意3D世界数据集,升级到360p,能根据图像提示生成持续10-20秒的环境。
But then around that time, there was a lot of progress on video models in general. And sort of they had their moment like text to image had had a few years before. And so it came became very clear to us that for the next phase of Genie, we could go still something a bit larger scale and it would probably work, right, because we'd seen that the video models had scaled pretty effectively. So we decided to go from two d games to a dataset to any three d world dataset. And we scaled it up to three sixty p, and it had the ability to generate environments again from image prompts that lasted maybe ten, twenty seconds before they degraded.
虽然仍非实时(每帧需等待几秒),但能交互几分钟。对3D世界而言这很突破——毕竟采用自回归生成方式,连维持10-20秒都是不确定的。所以这确实是令人振奋的成果。
So it, again, wasn't real time, but so you could play with it for a couple of minutes and wait a few seconds for each frame, and it kind of worked for three d worlds. And so that was that was also, at the time, not completely obviously going to work, I think, because it was doing this, like, autoregressive generation. And so it still wasn't clear that this would last, like, even ten, twenty seconds. So I think it was, like, quite an exciting result.
那么Genie二代是否如你所言克服了模式坍塌问题?
And did Genie two overcome the mode collapse issue as you
某种程度上是的。Genie二代是扩散模型,具有略微不同的特性,但仍仅支持图像提示。它的表达能力远不及视频模型。
Yeah. To an extent. So Genie two was a it was diffusion model. So it had slightly different properties, but it still was only supporting image prompting. So it wasn't it wasn't anywhere near as expressive as as video models.
明白吗?它仍需要你选择一张适用的图片,但这并非总能奏效。它无法自主生成世界,需要你提供特定格式的图像——比如画面中要有位置恰当的可控主体。
Right? It still required you to select as an image for it that worked, which didn't always work. And it couldn't generate its own it couldn't generate its own worlds. Right? It it required you to, like, generate an image that was in a certain kind of format where there's, a clear agent in a in the right location, for example.
然后它才会基于此进行模拟。而我希望在Genie三代的讨论中重点强调的是:它完全能做到这点。总的来说,Genie二代当时验证了该方法的可行性,但非实时运行。我们从游戏引擎中看到,实现实时性会带来巨大影响。
And then it would simulate from there. Whereas I think what I'm hoping we spend a lot of time talking about with Genie three is that it can definite it can definitely do that. So yeah. I mean, to to kind of bookend this part, Genie two at the time, it was sort of a a good sign of life for this general approach, but it wasn't real time. And we'd seen with game engine that that was really impactful if you get that right.
虽然画质不错,但仍不支持文本输入,表现力有限,远不及同期发布的VO2等顶尖视频模型的视觉效果——那些成果当时让我们所有人都震撼不已。这正是我的联合负责人Shlomi在Genie三代展现专长的领域,我善于借力专业人才,而当时我们团队已拥有这些领域的专家。
And the visual quality was was good, but it still wasn't using text as input. It wasn't as expressive, and it was nowhere near the visual quality of state of the art video models like v o two, which came out at the same time, which was absolutely mind blowing for all of us. So I think that's where my co lead genie three, Shlomi, came in and really had expertise. I like to borrow people's expertise. And and at that point, it was pretty clear we had an expert in those areas.
能否谈谈如何整合其他技术方案?比如游戏引擎的体感反馈是如何融入Genie研究体系的?容我思考片刻——Jack提得很好:我们一直在讨论研究脉络,从Genie一代二代到现在,还没深入探讨三代演示中最震撼的部分。Shlomi或许可以接着分享,这些过往项目如何影响了整体研究。
Yeah. Talk a little bit about how the how you incorporated like these other approaches, you know, game engine, feel, like how is that incorporated into the Genie line of research? And let me think about this, because I think Jack, you make a good point, which is like, we've been talking about like this trajectory, like we've been talking about, you know, Genie one and two, we haven't really talked about, you know, Genie three and like, you know, when you pull up the demo, like what's super impressive about it, you know, maybe we'll let you do that as well, Shlomi, and then talk about how, you know, these other projects you've worked on kind of, you know, influenced, the overall research.
好的。本质上,我们在GenieFree中尝试将所有维度推向极限。沿着Jack描述的演进路径:生成质量不断提升,Genie一二代能构建越来越连贯的世界——但仍存在局限。
Sure. So, basically, I think in GenieFree, we're really, we really tried to push it to the limits in, in across all of the dimensions, right? So if you see it as like, if we see those, the progression that, that injectors, that is pretty much just talked about, right? So we see that we have models that are more capable in terms of the quality of their generation. We see that Gini one and Gini two are able to basically generate more and more, like worlds that have more cons that are more consistent, but still to a limit, right?
我们着力突破轨迹连贯性的持续时间边界,这仅是改进维度之一。还有分辨率、交互时长、帧生成速度等——当这些指标相乘时,整体提升可达百倍量级,单从算力角度考量就非常可观。
So we wanted to push on how long this, this trajectory can remain consistent. So that was one of the dimensions we wanted to, to, to improve. And of course the resolution, right? So I think all of those, like if you look at the resolution, the duration of the interaction, how fast, the next frame can be generated, you get a very significant kind of like, if you multiply all of those dimensions, you get to quite, I don't know, like a 100x improvement. And if you just think about it as raw compute, right?
必须承认这个项目存在风险,成功并非必然。但当下正是全力推进的时机。在Virtu获得良好反响后,我们明显感受到质量飞跃——尽管它既非实时也非交互式。
And, and I think we, it was very clear that, this is not, not something that's obvious that's gonna work. So there is some risks to this project. But we also felt that this is the time to, to go for it. So I think what happened basically is that, after, you know, we launched Virtu and, and basically there was, it was very well received. We felt like the quality really improved and that was definitely not, not real time and not interactive.
随着Genie二代在另一方向取得突破,我们决定融合这些改进向量,向更高层次进发。这就是Genie三代的使命——或许可以双关地说——试图集诸界之大成。
And then Genie two came out and definitely also pushed Genievel up in this different direction. We just said, okay, let's try and combine those vectors of improvement and go to the next, next level. And, and that's pretty much what Uni3 is about, trying to, to bring the best of all worlds, pun intended maybe.
而且,你知道,当你谈论这个话题时,比如将这两个领域的最佳部分结合起来,是理念的融合吗?是架构的结合吗?还是数据集的整合,或者某种以上所有元素的组合?
And, you know, when you talk about this, like bringing the best of these worlds together, is there, you know, is it bringing together ideas? Is it bringing together architectures? Is it bringing together datasets or like some combination of all of the above?
首先,我认为这涉及方方面面——虽然可能有点老生常谈——但人绝对是核心因素。我们让来自不同团队的成员带着他们的经验、动力和能量加入这个项目。技术上确实存在共通的挑战:我们最终输出的都是像素,技术以文本为输入,而我们希望能生成具有一致性的内容。
First, I think it's all, you know, this may be a cliche, but it's definitely about the people as well. So we, we had, people from different teams, kinda like bringing their experience and their motivation and energy into this project. So I think that was a big thing. And then in terms of the technique, the tech, there's definitely, there definitely shared technical, challenges, right? So basically we generate the output is eventually pixels, and, and we had tech, take the text as input and we wanna be able to generate something that is, that feels consistent.
即使是像Veoda生成的八秒视频,你仍希望感受到连贯性。如果镜头移动,画面应该保持协调,让人感觉这可能是真实世界拍摄的。交互场景也是如此——虽然要根据用户输入生成下一帧,但这种一致性至关重要。早期某些模型能生成下一帧,某种程度上像游戏引擎,但缺乏长上下文记忆。
So even if it's a video of eight seconds like Veoda generates, you still want to feel that it's consistent, right? If the camera moves around, things should look consistent. You need to want to to get a feel that this is a really like, was taken maybe in a real, the real world. Right? And, and the same goes once it's interactive, we won't be able to generate the next frame based on the, the, the user's input, but still this consistency is really important because if, you know, if we just, there were some early models that were able to maybe generate the next frame, like actually, you know, game engine in a way was that, like it was, didn't have very long context.
它能运行是因为学习了《毁灭战士》GIF游戏的特定属性,记住了关卡样貌,并非真正按我们理想的方式生成。我认为真正的突破在于从文本生成内容——这是从图像模型到视频模型的核心能力,也是整个研究路线的主要创新点。
And the reason it worked is because it's kind of learned a specific, properties of the, this, this GIF game of Doom basically. So it's kind of like remembered how the level looks like. So it wasn't really generating it as, as we would want. Right? So I think that was actually being able to generate things from text.
文本作为高度压缩的表征,是学习概念的强大方式。这次我们从文本出发描述世界,再将用户或智能体投放其中自由探索。基础设施方面,Google DeepMind一贯致力于理解模型扩展的核心机制,这些概念可跨模态应用,当然在不同延迟和内存需求间需要权衡。
This is the core capability that, that you see from image models to video models, think that was kind of like the, the main innovation breakthrough for all of this line of, of research, that text became, like you could start with text and text is such a gen such a compressed representation and it's such a strong way to learn concepts. So I think this is really we it was obvious that we start from text this time, we wanna describe the world and then we just want to drop you into that and then to let the user or agents to just go around and and explore it. So that's I think those are the similarities, of the projects and and this is how we, of course the infrastructure, hardware, you know, we have a lot of, in general the approach we take at Google DeepMind is to try and understand the core mechanics of how those models scale. So those concepts are pretty much, we we could leverage them across different modalities and definitely, different trade offs of latency and and memory, for example.
Genie 3中有你特别喜欢的案例吗?
Do you have a favorite, example from Genie three?
我特别喜欢那只蜥蜴,虽然不够逼真。那只跳跃的折纸蜥蜴撞上折纸河流时溅起水花的设计很棒。还有团队成员在X上发布的案例——当用户低头看到鞋子和水坑倒影时,效果非常真实。
I really I like the lizard, although it's not photorealistic. The I really like the lizard, that jumps, you know, the origami lizard. I really like that it splashes a little bit of water when it's, once it hits the, the origami river. And of course the pa-, the puddles, is, if you, if you know the, I think it was some of the examples were basically posted, on on X, by team members. And I think I actually really like those that that people can, played with the model and then there is one that you walk around and like the the user is looking down to their shoes and then and basically sees them in in a puddle and it's pretty it's very realistic, I think.
Jake,你最喜欢的例子是?
Jake, your favorite example?
有很多展现不同能力的酷案例。但最令人惊讶的是那个需要一分钟解释的
There's loads of really cool examples, I think, that show different capabilities. Right? But the one that I think was the most surprising was the one which is the inception sample where essentially it takes a minute to probably explain this, but essentially, we can prompt our model with videos. And this is a really exciting capability because, obviously, with an amazing video model like v o three, you can generate really cool videos and then actually prompt Genie three with the video and then continue from there. Right?
我们团队在尝试用视频提示模型时,Jacob不小心输错了字幕。结果发现当文本提示与视频潜在空间不对齐时,模型会自行调和——如果你背对视频提示的世界转头,另一个奇幻世界就会出现。最神奇的是我们用模型演示视频去提示模型本身。
And that's something that's really exciting that that we were playing around with. But then one person in the team, Jacob, by mistake, actually didn't put the right caption. So, essentially, what you realized was if you don't align the caption or text prompt with the video latents, then actually the model kind of makes it work. So, basically, what happens is if you you're facing you're facing the world with the with the video prompt, and then you'll look away and this other kind of magical world will be there. And so what it what it tried was actually prompting the model with a video of people playing the model demo.
我们在社交媒体上看到一段视频,是谷歌DeepMind的几位员工在办公室里试用实时演示。他们输入的提示词类似‘有霸王龙的丛林’之类的。在Genie 3生成过程中,显示他们实际操作的屏幕突然切换成了这个丛林世界,笔记本电脑屏幕也同步变化。这说明Genie 3能同时更新两个界面,我觉得非常不可思议。更神奇的是,当你转身时,窗外景象也变成了提示词中的丛林。
So we have this video that was also posted on on social media of couple of folks at Google in Google DeepMind playing with the live demo and then in an office room. And then the prompt was, like, a a jungle with a T Rex or whatever it was. And then during the the Genie three generation, the screen that is showing what they were actually playing switches to this jungle this jungle world and so does the laptop. So Genie three knows that it updates both, which I think is pretty incredible. But then it also is the case that when you turn away, you you see outside is actually a jungle just as in the prompt.
当你走进丛林再回头时,又会看到他们所在的办公室和正在操作的场景。这简直需要亲眼所见才能相信,但确实惊人地展现了模型对事物的理解能力——它知道需要同步更新两个屏幕,也明白从办公室外出再回望时应该看到建筑物。这种逻辑理解非常酷。
And when you go into the jungle and turn back around, you see the office that they're in and you see them playing. And you kinda have to see it to believe it, but I think it's quite incredibly incredibly represents how the model kind of actually does have some understanding of things because it understands that, like, when updates the the screens, it should do both. Right? And then it also understands if you're in an office and you go outside and you look back, then you should see some kind of building that they should be in. So I think that's really cool.
虽然项目初衷并非要实现这种效果,但追求有趣目标时往往会有意外收获。这个案例就是很好的例证。
It's definitely not what the goal of the project was to be able to do that, per se. But I think sometimes when you when you pursue, like, interesting objectives, then unexpected things can can arise. Right? So I think that's a really nice example of one.
让我特别兴奋的不是某个具体演示,而是白板功能所展现的能力。演示中有块写着'苹果'、'基因自由'和画着树的白板,这个例子完美体现了模型的记忆能力——正是这种能力让它成为真正的世界模型,让你感觉身处一个真实世界。
I I think, you know, one, not necessarily a demo, but the capability that I'm very excited about. I think it's one of the demos we have is for the whiteboard that has, you know, an apple and gene free written on it and a tree. And I think the nice thing is that this has really demonstrated, I think, one of the capabilities of memory. Right? And it's it's that's what to me, this is like the basically what makes it a world model that you actually feel in a world.
你看着白板,望向窗外,再回头时所有东西都原封不动地保持着,这种一致性非常强大。
Right? You look you look at the whiteboard, you look through the window, you come back and it's there. It looks exactly the same. Everything is in place and it just, you know, really, yeah, really strong. Yeah.
确实。
Yeah.
对我来说,滚筒刷演示同样震撼。虽然场景很简单——就是有人在粉刷房间墙壁,但当视角移开再转回时,之前随机涂抹的笔触完全保持原样。自回归模型能如此精确地记忆多帧前的画面,实在令人惊叹。
I was going to say for me, for the same reason, the, the painting roller one is super impressive. Like it's, yeah, maybe it's the simplest world of all of the demos. You're in a room and someone's like painting the wall, but like you see the viewport pan away from the wall that has been painted with these random strokes and then pan back and the strokes are like perfect, you know, perfect memory from, you know, what was painted, you know, frames before for an autoregressive model to capture that so precisely. Yeah. Super impressive.
团队第一次看到这个生成效果时,很多人都不敢相信。因为模型不仅维持了初始视觉环境,还完整保留了你的操作痕迹及其后果。这同时表明模型也能用于职业技能培训等场景,是我们未曾预想的应用方向。
I think when we saw that, like, generated by someone, it was kind of like a bit of a disbelief across some people on the team because it was just like, we didn't even know the model was capable of of of doing something like that. Right? Like, it's not just that the original visual world is is maintained, but actually the actions you took in it and the consequences of the actions you took are maintained as well. It's also pretty cool because it shows you could use the model for sort of more vocational things as well. I think that's quite interesting as a use case that we didn't really think of as well.
看来你对Genie 3的样本选择很有眼光。
So that's a really that's a really think you have great great taste in Genie three samples.
现在我们来聊聊模型本身。如我们提到的,高层次的挑战包括一致性、延迟问题,还有生成画面的分辨率与丰富度。之前也讨论过这个本质上是自回归的模型,涉及变压器和扩散模型等技术。
Yeah. Let's talk a little bit about the the model itself and like the you know, we we mentioned, I think at a high level, you know, challenges like consistency, latency is clearly a challenge. I, I think, you know, you've talked about kind of resolution or the, you know, the richness of the produced visuals. And, you know, we've alluded to kind of the model being auto regressive in nature. We've talked a little bit about transformer and diffusion.
我们应该如何思考模型架构、模型本身,以及你们如何利用建模过程的各个方面来克服这些挑战?
Like how should we think about the, model architecture, the model and how you've used aspects of, the modeling process to overcome these challenges?
是的,是的,别担心。我认为模型的一个关键方面在于它是自回归的,这意味着在这个上下文中,下一帧是基于之前发生的所有事情的长序列(可能是很长的序列)生成的,对吧?
Yeah. Yeah. No worries. So, you know, I think one, one of the key aspects of the model is this being, it is basically autoregressive, which, which means in this context that, the next frame is generated based on the long sequence, potentially long sequence of everything that happened before. Right?
因此,模型必须观察特定帧之前发生的事情,对过去的这类信息进行推理,并决定哪些信息与下一帧相关。关键在于这一过程必须非常快速,每秒要发生多次,因为我们永远无法预知用户的下一个动作是什么。我认为这正是让它成为实时交互式的原因,不仅仅是实时,而是实时交互式,它能对发生的事情做出响应。这个术语实际上引导了我们系统和架构的设计,即在实时性的同时保持交互性。
So the model has to, look at what happened before, the particular frame, reason over this kinda like past, and decide which information is relevant to to the next frame. And and the key is that that has to happen very quickly. So that has to happen multiple times per second because we we can never know what's the next, actions from the user gonna be. Right? So, and I think that this is really what makes it real time interactive, not just real time, but real time interactive, so it responds to what's, what the, the so I think this, this term is really, that's, that's kind of what's, what led our design of the system and architecture, the interactivity while being real time.
基本上,一切都归结于这种设计决策。有趣的是,为了实现这种极低延迟,并能够回顾之前发生的事情,我们不得不深入研究如何选择和利用合适的架构和规模,既能实现高质量的模型,又能利用我们拥有的顶级硬件来构建一个真正可用的系统,而不是仅仅停留在理论上。我们希望最终能够与更多人分享。
And basically the everything boils down to those, to to that kind of design decision. And the interesting thing is that to get this to this very low latency, and, and to be able to look back into what happened that, before that's kind of like, basically made us, look into how we can leverage and pick the right architecture and scale that enables both very high quality model, but also, leveraging the best, you know, the best in class hardware that we have to actually build something that works, and it's not just kinda like it. We don't end up with, you know, a theoretical system or or something that might be a paper, but actually something that we hope eventually we'll be able to share with more people in the future.
顺着Shlomi的思路,我认为我们团队必须设定一个目标,即在所有这些维度上都要有雄心壮志。如果你一开始不坚定地承诺这一点,那么一次性实现所有这些目标会非常困难。这正是模型的挑战和魔力所在——它能够具备记忆、高分辨率、多样化的世界,同时还能实时运行。在每个维度上,我们都有非常出色的人才。从团队的致谢部分可以看出,这个团队比我们之前在Genie系列中的团队要大一些。我认为这几乎是一个全新的模型,只是继承了名字。我们在所有这些不同领域都有非常优秀的人才,他们为每个单独的组件付出了巨大努力。
To go on the same kind of direction as Shlomi, I think we had to really set ourselves as a team this goal of trying to be ambitious in all these dimensions, right, which is something that if you don't commit to it right at the beginning, then it'll be very hard to achieve all of it in one go, which is really the challenge and and really the magic of the model, right, is that it can have memory, high resolution, diversity of worlds, and also be real time. And I think in each of those dimensions, we had really amazing people on the scene. From the the team, obviously, you can see from the acknowledgments, it's a slightly bigger team than we obviously had before in the genie series. I think I would consider it almost like a new model but with inherited name. And we really had, like, great people in all these different areas that worked really hard on on each individual component.
对吧?但同时也要对其他部分有清晰的认识。我认为,看到大家能够将这些部分整合在一起,真是令人难以置信。但每一个部分都是一个挑战,对吧?
Right? But also with awareness of the other of the other parts. And I think that, like, it really is incredible to see the kind of things that people were able to put together. But each one was a challenge. Right?
所以,这些部分中没有一个是容易实现的。
So it wasn't the case that any one of those parts was easy to achieve.
当你们谈到挑战时,没有特别提到一致性,而博客中确实将一致性作为一种涌现属性提到,暗示它可能并不是你们设计时的主要目标。是这样吗?
And when you talked about the the challenges, you didn't specifically mention consistency and the blog post does specifically mention consistency as like an emergent property suggesting that it wasn't necessarily something that you were designing towards. Is that the case?
我认为可以这样理解:这绝对是我们的目标,并且我们确实在设计时考虑到了这一点。比如,当我们列出模型的规格时,明确提到了一分钟的记忆能力,这是我们的目标之一。但关键在于,这并不是一种显式的世界表示。
I think one way to think of it is like, we definitely, that was definitely our goal and we definitely designed it in, in a way that to achieve this goal. Like one of the, when we listed our, what we want to, like, you know, the spec of the model, definitely a memory of, about one minute was, was in it. Right. That's kinda like was, was, was our goal. But I think the key thing is that it's not an explicit, there is no explicit representation of the world.
举个例子,有很多方法可以实现显式表示,比如你可以实现一个3D引擎,它有明确的网格,然后进行渲染,你可以随意移动。这种方法有很多限制,但也有惊人的效果。如果我们更多地进入机器学习领域,你会看到神经辐射场(NeRF)和高斯泼溅(Gaussian splats),它们基本上推导出世界几何的某种表示,基于此你可以四处走动并渲染一切。这些都是显式表示,而我们不想这样做。我们认为,尽管这些方法有很多应用,但它们也有局限性,尤其是在动态环境方面更难实现。
That's, you know, for example, there are approaches, there are a lot of approaches to, to have, you can implement, for example, a three d engine that has a very explicit mesh and then this gets rendered and you can go anywhere you want, right? It has a lot of limitations, but also amazing implications. But if we go more into the machine learning world, then you have a, nerves and then Gaussian splats, and then that basically derives some representation of, of the geometry of the world, and based on that, you can just walk around and everything gets rendered, right? So these are all explicit representations and we didn't want to do that. We think, while they do have a lot of applications, they're also limited, it's much harder to have dynamics environments.
我们某种程度上希望模型能自行领悟这一点。这就像是,如果你想要那样,我认为在我们实际操作时虽以此为目标,但我们觉得不应在系统中内置任何机制来实现它。我想这正是我们——你知道的——如果我们愿意,我认为我们是'苦涩教训'的好学生,我们相信(至少我个人坚信)许多这类能力仅通过数据就能习得。只要你以某种方式设计系统,使其具备学习这些能力的架构。
And we kind of wanted the model to learn that on its own. And that was kind like, if you want that, I think when the when so while we did it, aim for that, we didn't think we should build anything into the system to achieve it. I think that's what we, you know, if we wanna, I think we're good students of the bitter lesson, and we believe that, at least I believe that, many of those things can be learned, from data alone. If if you just kinda like design the system in a way that's that's yeah. It's set up to learn those those capabilities.
对。
Yeah.
这当然也意味着每个环节都必须极其谨慎地处理,对吧?因为模型会学习数据中的内容。所以你需要一个真正强大的模型来实现这点,但同时也必须用正确的数据训练它,确保它学到正确的东西。
Which which also, of of course, means that, like, every part has to be really done carefully. Right? Because the model will learn what's in the data. Right? So you have to have a a really capable model that can do that, but you also have to train it on the right data so that it does learn the right things.
因此要实现这点而不借助其他方法,确实需要多方因素协同作用。
And so it really has to be a lot of things coming together to to get that without adding those other methods in.
我们讨论的这个模型有个重要特性就是它的可提示性——从文本开始生成世界。博客里还有个例子是关于通过提示影响世界行为的。这是指Genie本身,还是说我们现在讨论的是Genie环境中的智能体?这两者是否存在区别?你如何看待当前这种区分,以及这种智能体交互范式的未来发展方向?
So one of the big, you know, features of the model that we've talked about is that, you know, it's promptability. It starts with text that generates the world. There's also an example in a blog post of where you're prompting like the behavior in the world. Like, that, is that Genie or are we talking now about like an agent within the Genie environment? Like, is, is there a distinction between these two and like, you know, both how do you see that today, but also where do you see that whole agentic interaction paradigm going?
我想可能需要先解释下所谓的'可提示世界事件'——你指的应该是这个功能。这个能力并不直接关联智能体,可以理解为上帝模式:你可以任意改变世界,比如召唤沙尘暴、让箱子从天而降,我们尝试过各种操作——基本上可以随心所欲地改变世界中的任何元素。
I think it's, so maybe just to elaborate a bit on, so what we call promptable world events, I think that's what you're referring to. So this capability is, it's not directly tied to the agent. So you can think about it as God mode if you want, so you just want to change anything in the world. You want to have a sandstorm coming, you want to drop a box, want to like, we tried a bunch of stuff like dropping objects from the sky or changing anything. So you can just change anything in the, in the world that you want.
除了可盈利的世界事件,还有像走到某个位置这样的——
Well, there's profitable world events, but there's also walk to the, you know, the-
哦,红色架子。
Oh, red rack.
走到下一个位置,这不同于上下左右那种
Walk to the next, which is different from like the up, down, left, right type
操作。所以或许,或许我可以,我,我
of thing. So maybe, maybe I can, I, I
两个都谈?你可以先开始,比如继续讨论世界事件,然后我们再聊其他内容。
Talk about both? You can start, you know, continue with the world events and then we'll get to the other stuff.
是的。我们确实拥有这种可提示的世界事件功能,能够直接改变世界状态并注入新信息,就像你说的那样。这本质上实现了对世界的控制,不再局限于初始提供的提示。对吧?所以这更像是临时性的...
Yeah. So we do have this promptable, promptable world events that allow to just make change in the world and, and inject some new information as like you said. So that allows, basically control of the world beyond just the prompt that's provided in the beginning. Right? So this is more of like a temporarily
就像在生成过程中间临时插入一个提示。
Like you're injecting a prompt in the middle of your generation in time.
这个能力其实相当深层,因为并不总是显而易见的——有些提示本身就不合理。比如你说'一扇门在沙漠中打开',模型可能会困惑'该开什么门?'。所以我们发现有时会产生奇怪效果,但当情境合理时效果惊人。比如我们有些精彩样本:巨龙从天而降落在隧道中央——这种场景下功能就展现出强大潜力。
And it's, it's quite a hard, like it's, it's quite a deep capability I think because it's not obvious, like some there is not always the prompt makes sense, right? And I think like for example, you want, you say, okay, a door opens and you're in the middle of the desert, what door should open? And that's the model is like, you know, I don't know. So, so we see that sometimes it can make things, make weird stuff because the model is trying to, but, but when it makes sense, we often see that it does work and we get, kind of very nice, and we have some very nice samples and we have something like the dragon that, appears out of the sky and lands in the middle of the, and like the tunnels. So I think there are definitely, some cases where it works really well and it's very powerful capability.
暂停一下,我能想到两种实现方式:一种是先生成沙漠场景的预期帧,再以此为基础替换后续生成;另一种更可能是直接整合到模型架构中。能具体说说技术实现吗?
And now if I can hit pause on that, like, I can think of that as like, you know, say the door opens in the desert. Like, I can think of a a model which like you generate your next frame, your expected next frame with the desert, and then you like, you know, use that as the input frame to, to generate a frame that will replace that frame in the continued generation. But I could also imagine, you know, that being like you know, a crude way of doing something that's like more integrated, into the model architecture. Like, can you talk a little bit about like how that is done?
我们本质上想定义什么是'事件'。现实中发生的事情不全是主体行为驱动的——这与你们提到的SEMA模型(可以请Jack详述)不同,那是通过外部模型控制主体行动。回到可提示世界事件,关键在于它不基于单帧判断...
I think, I think what we, basically what we wanted is to be able to, to, to, we think what is an event, right? So if you think about, we walk around the world and things happen around us, not necessarily, they're not, they're not done by us, they're not the agent centric, right? So I think that's basically the distinction between what you mentioned about the agent in the world acting in the world, maybe walking somewhere, which in the videos that you've mentioned, this was done by an external model, the SEMA model, we can talk about. I think it's very interesting, maybe Jack can tell us a bit more because he's very I mean, I think something that was also tried with Genie two and worked, so it's really cool and then we built on top of that. But if we go back to promptable word events, and then, the ability to flag the it's not just based on a single frame, right?
比如你想看到某物不会立即出现,但当你左转时会发现有人穿着Jeannie免费T恤。我们有些滑雪案例就是这样——事物会自然融入环境,而非突兀出现。重点是要让生成内容与世界观有机整合。
So it can be like that you want to see something in the world and then you look, you actually it doesn't happen immediately, but then you look to the left and you see, for example, a person. So we have this some of these examples where you have you ski down the slopes and then you look to the left and you have a person wearing a Jeannie free t shirt. So, I mean, it's, it's like it's gonna, it could just materialize things in the world, but doesn't mean that it's like, you know, just pops in front of you. We want it to be like ideally something that's integrated and makes sense in the world, right? Because it's easy to just drop something and it looks very artificial.
我们追求的是无缝融合的真实感。模型最终要生成接近训练数据的效果,而训练数据本质上是写实的。
We want it to actually be integrated, to look real real, right? So the world the model eventually wants to make things that look like a training data, which ultimately should be realistic.
所以这是将附加条件信息深度整合到下一帧生成流程中,而非简单粗暴地插入画面。非常有意思。
So somehow, like, additional, conditioning information that's, you know, integral to the next frame generation process as opposed to, we're just gonna drop this thing in in the middle of the view. Super interesting.
就像你在命令模型执行,而它回应'等我准备好'——虽不是技术术语,但确实实现了自然流畅的效果。
Telling it's like you're telling the model to do it and it's like, I'll do it when I'm ready kind of thing. Like, it doesn't Let them That's that's not the technical term, But Yeah. Yeah. It does it in a way that feels natural.
那么,杰克,简单谈谈SEMA智能体吧。
So, Jack, talk a little bit about the the SEMA agents.
好的。正如我之前所说,回顾项目历史时,我们设计这个环境就是为了智能体。在Google DeepMind,我们当然有许多研究智能体的项目,而真正专注于三维世界的正是SEMA智能体。他们致力于训练能在三维模拟环境中实现语言目标的智能体。
Sure. Yes. So we, as I said, going back to the history of the project, designed this to be an environment for agents. And at Google DeepMind, obviously, we have lots of projects working on agents, and the one that's really focused on three d worlds is the SEMA agent. And so they're trying to train agents that can achieve language goals in sort of three d simulated environments.
他们在2024年2月左右发布了公告或博客文章,大致展示了他们的思路。目前他们正在现有游戏中进行训练——已经开发出能在不同游戏世界里完成多样化任务的强大智能体,但最终受限于只能访问这些特定游戏世界。明白吗?它无法在任意想象的游戏世界或现实世界中进行训练,因为可训练的环境集合是有限的。
And so they have an announcement from, I think, or a blog post from probably around February 2024 where they sort of showed a bit about how they're thinking about this. And what they're doing right now is they're training in existing games. So they've got a really capable agent that can do quite diverse things in different game worlds, but, ultimately, it's it's limited by only having access to those game worlds. Right? So it can't do it can't train in any imaginable game world or in the real world because it's got just access to a, like, you know, a finite set of of environments to train in.
而这正是Genie试图解决的核心问题——生成新环境。但有趣的是,SIMA智能体展现出惊人的泛化能力。尽管只在较小规模的世界集合中训练过,你却能直接把它放入从未见过的Genie生成环境中。
And this is kind of the exact problem that that that Genie is trying to solve, right, is to generate new environments. But, also, the SIMA agent was surprisingly general. Right? So even though it's been trained on this set, like, a smaller set of worlds, you can kind of drop it in one of the Genie environments that's never seen before. Right?
比如用文本描述生成一个Genie三维环境——可以是工厂车间之类的场景,背景里有辆叉车。生成这个世界后,你就可以对SIMA智能体说‘去叉车那里’。甚至可以说‘去找能搬运东西的那个物体’。从那一刻起,SIMA智能体就会把Genie生成的世界当作普通环境来交互。
So you use text to create a Genie three environment or world to say you describe a scene that could be like a factory floor or something like that. And you could say in on in the background, there's a forklift truck, and you generate this world and then you say to the simmer agent, like, go to the forklift truck. Right? Or you could even say to it, like, go to the thing that can lift things or something like that. And then the simmer agent kind of, like, from that point onwards, treats the the Genie generated world as if it's any other environment.
它并不知道自己是个模型。它一无所知。它只是看到像素,然后说,我要按下这个键来实现这个目标。而Genie看到的只有按键动作。对吧?
It doesn't know that it's a model. It doesn't know anything. It just sees the pixels, and it says, I'm gonna press this key to achieve this goal. And then all Genie sees is the key press. Right?
它并不知道模拟智能体想要做什么,因为如果知道的话,它可能会促成这件事。对吧?它只知道要向前进。然后它模拟下一帧画面,模拟智能体看到新画面后说:好的,我要继续前进。
It doesn't it doesn't know what the sim agent's trying to do because if it did know that, it might make it happen. Right? And all it knows is it wants to go forward. Right? And so then it simulates the next frame and then the sim agent sees the next frame and it says, okay, I'm gonna keep going forward.
这些过程会同步发生,像打乒乓球一样来回交替。关键的是,如果模拟智能体做了错误动作,目标就无法实现。对吧?如果动作正确,就能达成目标。这样模拟智能体就能从经验中学习,更频繁地实现目标。
And then these kind of happen in tandem, like back and forth. And then critically, like, if the sim agent does the wrong actions, it won't achieve the goal. Right? If it does the right actions, it will achieve the goal. So then, of course, you can see that the simmer agent can then learn from this experience to achieve the goal more often.
对吧?可能有些事它现在还做不到,但可以在这些虚拟世界里学会。所以我们本质上看到了生命迹象——一个智能体通过与另一个互动,在这些具身化世界里教会它新技能。对吧?而且是以前所未有的规模实现的。
Right? And there may be some things it can't do yet, but it could learn to do in these worlds. So we essentially have signs of life that we have one agent interacting with with another one essentially to teach it new skills in these more embodied worlds. Right? But on a at a scale that hasn't really been done before.
为了形成闭环,很酷的一点是可以将这与世界事件相结合。对吧?因为即便是走在街上这样平淡的环境,如果突然加入比如一只猫跳出来之类的事件,就会变得有趣得多。这样我们就能训练智能体在简单环境里应对各种突发情况,因为我们掌握了环境调控这个额外手段,能让环境对智能体更具挑战性。
And then to kind of, like, close the loop on this, a really cool thing could also be integrating this with the world events. Right? Because even a kind of benign environment, like walking down the street, might become much more interesting if you then injected, say, I don't know, a cat jumps out or something like that. So, actually, you can teach our agents to be robust to all these different kinds of things even in simple environments because we have this additional lever to pull on the environment side, to make it more interesting and challenging for the agents.
有意思,真有意思。杰克,我们之前稍微聊到了Genie One的局限性,你知道的,就是在传达艺术、局限性和如何呈现它们之间的平衡——是直接摆在前面还是像你在Genie三篇博文中那样放在较后的位置。不过我很想听听你即兴谈谈这些局限性,然后我会把话题交给你,由你来开启关于下一步研究方向的讨论,或者至少先谈谈你看到的未来研究方向。那么杰克,我们就先聊聊局限性吧。
Interesting. Interesting. Jack, earlier we were talking a little bit about the limitations of Genie One that, you know, and this kind of balance in, you know, communicating the art, the limitations, leading with them versus, you know, having them further down in the page and they are further down on the page in your, you know, Genie three blog posts. But I'd love to have you riff a little bit about the limitations and then just let me, I'll turn it over to you to talk a little bit about, or at least start us off talking about kinda next steps and where you you see the research going and then you, Jack. So let's let's talk about limitations.
你知道,你如何
You know, how do you
我记得我们列出的内容不全了,但对我而言最突出的一点是:我们讨论过模拟其他智能体,而目前系统并未实现这一点
So I think what we I can't remember all of the things we listed, but the one that I think stands up for me is that we talked about simulating other agents, and we don't do this
任何世界中的多智能体交互
Any multi agent interactions within the world.
没错。我认为目前这方面确实非常有限。本质上这个模型只是在预测下一帧画面,最多能实现一些非常基础的其他智能体模拟。
Exactly. Yeah. I think we we we've we've mentioned that this is quite limited at this point. Ultimately, the model is predicting the next frames. It's able to get some kind of very basic simulation of other agents.
比如当行人走路时你挡在前面他们可能会停下,或者汽车行驶时你横穿马路它可能会刹车。但这远称不上是复杂的交互行为。所以这显然是个局限,而且可能在像Vio这样的系统中已经先进得多。
Like, if you walk in their way if they're walking and you stand in the way, they might stop. Or, yeah, if a car is driving and you walk in front of it, it might stop. But it's not the case that that this is, like, very complex interaction. And so I think that's definitely a limitation. And it's also something that you maybe is much more advanced in something like Vio already.
对吧?这是我们目前绝对不具备的。另外明显还有一分钟的时长限制——说起来有点滑稽,因为对Genie一代二代而言,一分钟已经堪称惊人。但咱们这个领域的发展速度实在太疯狂了。
Right? So it's something that we don't have for sure. I think there's clearly the one minute limitation as well. Like, it's kinda funny to say it because, like, genie one and two, a minute was sort of, like, would have been seen as incredible. But I think it's a bit like the the pace of our field is absolutely crazy.
所以这种限制在未来肯定会显得尴尬地短暂。目前我们说视觉记忆维持约一分钟,我想这...
Right? So this is the kind of thing that will seem embarrassingly short, I'm sure, in the future. So right now, we we kind of say we have visual memory for a minute. And I guess that is
单次Genie交互或游戏可以跨越数分钟,本质上这只是上下文长度的限制
An individual interaction or play of Genie can span multiple minutes. So it's just the context length essentially.
对,这个区分很重要。即使连续玩几分钟也不会像前几代模型那样出现画面劣化或严重模糊,但记忆确实维持在一分钟左右。
Yeah, then this is an important distinction. Like you can play for multiple minutes even, and it doesn't degrade or it becomes very blurry, like know, previous generations of those models, but again, the memory is is around one minute.
确实如此。此外,我认为现实世界的物理精确度并不完美。比如,如果你要求生成伦敦某条具体街道,它可能无法识别那条街。因此,我觉得这方面有些元素是可以改进的——如果你用文字描述一个非常抽象的世界,它几乎能准确呈现;但若描述具体地理位置,可能会发现结果与预期存在差距。
Exactly. And then and then also there is, I guess, like, real world physical accuracy is not perfect. So if you say, like, I want my exact street in London, it won't it doesn't know my street. So I think that there's some there's some, I guess, elements of that that that comp could be improved if you wanted to. If you in text, if you describe sort of, like, a very abstract world, it will almost certainly get it, like, on the money.
当你描述特定地理坐标时,可能会发现它某种程度上不符合你的期望。所以我认为这是另一个局限性所在。
If you describe the specific geographic location, you might notice that it's not what you hoped it would be in some way. So that's another one I think that that is a limitation. Yeah.
那么Ami,你有没有注意到什么明显的局限性?
So, Ami, are there any, you know, anything that jumps out at you in terms of limitations? Or
是的。这和你之前提到的智能体行动能力问题有些相似——目前行动空间相对受限。虽然智能体可以导航、跳跃或开门,但这些动作的语义层面还比较基础。可提示的世界事件让我们能控制环境,但这些并非以智能体为核心的动作。
Yeah. I think a bit similarly to what you've asked before about the agents being able to take actions. So while currently the action space is relatively constrained, right? So while we do, we can navigate, you know, the agent can navigate, we have some actions like, you know, maybe jumping or like some opening doors, but relatively basic in terms of the semantics of the action that the agent is taking. So promptable word events give us control over the world, but they're not necessarily agent centric actions.
这绝对是我们希望在未来改进扩展的方向。因为当前限制确实明显——尤其当我们需要智能体执行更复杂的动作时(不仅是走动,还包括拾取物品、输入代码或与其他智能体交互等)。这是极具挑战性的问题:人类通过肢体与物理世界互动(用手操作、用脚行走),但剥离这些实体要素仅保留视觉像素后,如何定义合理动作就变得异常困难。
So I think this is definitely something that we we hope to improve expand in the futures, but it, because it's such a, it's, it's a real limitation and we think that especially for making the, so it's, it's quite challenging, but it's for anything that agent, like if we want the agent to take more complex actions, not just walk around, but actually be able to, for example, pick up things, maybe, maybe, you know, do, like type in some code or maybe talk to a different agent and we, you know, there is a lot of that, and many things can happen in the world. And it's quite a challenging problem because there is no, you know, we operate in the world as people are as as in a very physical way, right? We we use our hands, we use our our our feet to walk, we we can, we have a kind like an embodied kind of like a presence. And, and when we take away all of this and we are left with the visual, with pixels only, then it's, it's much harder to define what's, what actions should actually happen. Right?
以开门为例——这不仅是「开门」的指令,而是一系列微观动作:走近门、握住把手、拉动门扇。如何建模这种动作空间是个难题,但突破这个限制也将极大扩展系统能力。
When you open the door, for example, you go and it's not just open the door, right? You move, you go and you grab the knob of the door and and you move it, right? Or, and you you pull it towards you. There's sequence of like micro actions that are being taken play. And and I think this is a there is a challenging question of how to model this this space of actions, but it's definitely a limitation and I think an opportunity to, to expand our capabilities.
关于未来方向,我感觉你们对智能体特性都很兴奋(毕竟来自DeepMind)。Shlomi,你认为项目最令人期待的发展方向是什么?
So digging into next steps, I get the sense that you're both pretty excited about the agentic aspects of this. No surprise coming from DeepMind. But what's like most exciting to you or most obvious to you, like, or, you know, present for you in terms of like where the project goes? We could start with you, Shlomi.
对我来说,关键在于能踏入自己或他人创造的世界,并实现感知、观察与交互。这种能力应用场景极其广泛——从娱乐(如交互式街景)到教育。这让我想起早年参与的创业项目:我们在旧金山市区建模,开发基于真实地点的增强现实游戏。
So to me, the the ability to be able to step into a world, right, that you created or someone else created, but then you can actually perceive it, see it, interact with it. I think it's huge. It can be really applied to so many things, from, you know, entertainment, which is very obvious, right? You said like, you know, Street View, interactive Street View, for example, so it can be somewhat anchored in the real world, but take you somewhere else. And, it actually reminds me of a startup I worked for, like a long time ago and we had like this kind of like a game placed in Downtown San Francisco.
这只是众多可能性之一。非娱乐领域也充满潜力——比如教育场景,或帮助人们通过视觉化达成心理突破。当你能在酷似自家环境里与虚拟蜘蛛共处时,这种具身认知会产生强大的心理暗示效应。
Of course we had to model the entire Downtown San Francisco and like it was a lot of, but the key idea of this startup was to actually have games happening in real world, so I think in real world locations, right, so that's like, for example, one thing you could have any, any interaction in some, but, but not a realistic one placed in a realistic location. One, just one example among many. And yeah, think there are other, there are other really interesting applications for people to, maybe, not necessarily entertain entertainment. It can be like education, for example. It can be, helping people see themselves accomplishing something they wouldn't have expected themselves to.
这种视觉化体验蕴含着巨大力量——看到自己完成意想不到之事会产生深刻心理影响。个性化环境(比如还原你家布局)能强化这种效果。虽然听起来有些超出...
Like, there is something very strong about the vision, seeing yourself or, or, or, or maybe accomplishing something and then it's kind of makes you like, it's a bit of a psychological perspective, but I think it's very powerful. And it's something that's not enough, like there is a, the personalization aspect of being able to go into the environment, walk around, maybe prompt it in a way that looks very similar maybe to your house, for example. So if you're afraid of, I don't know, doing something, afraid of spiders, maybe you can see yourself, walk next to a spider at your home and then maybe you can, like your brain says, okay, I can do it, right. So, I mean, it's, I think it's very out
关于它在现实世界中的潜在应用等等
of It's a real world's potential applications and so
对,对,没错。关键在于我们并不一定知道人们会如何利用这项技术,或者它未来会走向何方,目前还处于非常早期的阶段。正因如此,我们邀请了一些可信的测试者或学者与模型互动。最初我们想收集反馈,并希望随着时间推移,能更了解其能力边界以及人们感兴趣的应用场景。
Yeah, yeah, exactly. So it doesn't have to be point is that we don't necessarily know, how people will, or, or where this technology will go, and it's very early days for, for, for that. And that's why we had some trust testers kinda or or academics interact with the model. Initially, we wanted to get some feedback, and we hope over time to learn more about the capabilities and how and the applications that people are excited about.
杰克,你接下来的计划是?
Jack, next steps for you?
是的。虽然已经提到很多激动人心的想法,但最让我兴奋的是教会智能体在视觉逼真的具身化世界中与人类互动。我认为现有智能体最欠缺的正是这种在物理世界中与人类交互的能力,而像G3这样的模型可能实现这一点。
Yeah. So, I mean, there's not I mean, so many exciting ideas already mentioned. I think the one that I think the one that really excites me is teaching agents to interact in, like, visually realistic, like, embodied worlds with people in the world. I think that's a really missing capability for any of our current agents is to interact in the physical world with with humans as well. And I think that models like g three could enable that.
而且我认为几乎没有其他途径能达成这个目标。结合当下世界局势,这种技术尤其令人振奋——它能生成我们通过其他方式无法获得数据的多样化场景。虽然目前仍处于探索初期,但这是将开启大量应用场景的重要一步。
And I also don't really think there's any other way to achieve that. So I think that's something really exciting, especially with the world events, right, which could enable really generating, like, diverse scenarios that we wouldn't be able to get the data for any other way. So I think this is still fairly still fairly early in the journey, but I think this is, like, a big step that really will open up a lot of, use cases there.
好的,什洛米、杰克,非常感谢你们抽空详细介绍G3的进展和所有研究工作。深入探讨这些内容真的很有收获。
Well, Shlomi, Jack, thank you guys so much for jumping on and updating us on Genie three and, everything that you're working on. It's been really great to dig into it.
太棒了。非常感谢您的时间。
Awesome. Thanks so much for your time.
是的。谢谢,萨姆。
Yeah. Thanks, Sam.
好的。谢谢二位。再见。
Alright. Thank you both. Cheers.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。