项目Astra：与格雷格·韦恩一同探索通用人工智能助手

本集简介

在本年度最后一期节目中，我们将探索"阿斯忒里亚计划"——这项研究原型旨在开发能理解周围世界的通用人工智能助手的未来潜能。主持人汉娜·弗莱与谷歌DeepMind研究总监格雷格·韦恩共同探讨了该研究原型的灵感来源、当前优势与局限，以及潜在的未来应用场景。汉娜还亲自测试了"阿斯忒里亚计划"的多语言能力。延伸阅读/收听：双子座2.0 阿斯忒里亚计划与杰夫·迪恩解码谷歌双子座弗雷德里克·贝斯谈游戏、山羊与通用智能特别鸣谢（包括但不限于）：主持人：汉娜·弗莱教授系列制片人：丹·哈杜恩剪辑：拉米·察巴尔/TellTale工作室监制&制片：艾玛·尤瑟夫音乐作曲：埃莱妮·肖摄像指导与视频剪辑：贝尔纳多·雷森德音频工程师：佩里·罗甘廷视频工作室制作：尼古拉斯·杜克视频剪辑：比拉尔·梅尔希视频美术设计：詹姆斯·巴顿视觉标识与设计：埃莉诺·汤姆林森谷歌DeepMind委托制作若喜欢本期节目，请在Spotify或苹果播客留下评价。我们始终期待听众的反馈、新想法或嘉宾推荐！由AdsWizz旗下Simplecast平台托管。个人信息收集及广告用途详见pcm.adswizz.com

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎收听《谷歌DeepMind播客》。我是主持人汉娜·弗莱教授。2025年即将到来，智能体AI时代也随之来临。当然，如果您一直在收听我们的节目，几个月前就已经知道这一点了。那么现在请仔细聆听，我们将为您介绍下一个重大进展。

Welcome to Google DeepMind, the podcast. With me, your host, professor Hannah Fry. Now 2025 is coming, and so is the era of agentic AI. Although, of course, you'll have known about that months ago if you've been listening to us. So now listen in carefully as we tell you about the next thing.

Speaker 0

它被称为Project Astra，这是一个研究原型，正在突破通用AI助手可能实现的功能边界。这个智能体在设计上并不必然绑定于特定设备、屏幕或键盘。这正处于技术最前沿，而今天我们将亲身体验。Project Astra汇聚了我们本系列讨论过的所有技术：记忆、视觉、上下文理解、推理、实时交互。而能为我们深入解读这些技术，并讲述他与团队如何实现协同工作的是谷歌DeepMind研究总监格雷格·韦恩。值得一提的是，格雷格正是我在本播客第一季最初期采访的首批嘉宾之一。

It's called Project Astra, and it's a research prototype that is pushing the boundaries of what might be possible with a universal AI assistant. It's an agent that is, by design, not necessarily tethered to a particular device or a screen or a keyboard. This is right at the brink of the cutting edge, and today, we get to play with it. Now Project ASTRO brings together all of the things that we've spoken about in this series, memory, vision, context, reasoning, real time interaction, and someone who can tell us a lot about all of those and how he and his team have managed to get them to work together is Greg Wayne, director in research at Google DeepMind. Now Greg also happens to be one of the very first people I interviewed right at the very, very beginning of series one of this podcast.

Speaker 0

格雷格，欢迎回来。

Greg, welcome back.

Speaker 1

你好，汉娜。

Hello, Hannah.

Speaker 0

那么让我们从基础开始。什么是Project Astra？

Let's start the beginning then. What is Project Astra?

Speaker 1

Project Astra是一个团队、一个项目也是一个原型，旨在开发具备视觉、听觉和语音能力的AI助手，它能与你实时共存。无论您身处何地——通过智能眼镜、手机或电脑——它都能看到您的行动并与您讨论。

Project Astra is a team and a project and a prototype aiming to build an AI assistant with eyes, ears, and a voice that's co present with you. So it is with you in any place you're you are. So either through smart glasses or your phone or your computer, It can see what you're doing. It can talk to you about it.

Speaker 0

就像坐在你肩膀上的小助手。

Like a little assistant that sits on your shoulder.

Speaker 1

是的，是的。就像一只小鹦鹉站在你肩膀上，陪着你，和你谈论这个世界。

Yeah. Yeah. Like a little parrot on your shoulder that's hanging out with you and talking to you about the world.

Speaker 0

那只是比你更聪明。

That's just smarter than you.

Speaker 1

在某些方面是的。没错。

That in some ways. Yeah.

Speaker 0

我我我想，那么在某种程度上，这与谷歌助手或Gemini不同。是的。它更深入地融入了世界。

I I I suppose then in a way then that is different to, like, Google Assistant or Gemini. Yeah. It's more embedded in the world.

Speaker 1

是的。所以，谷歌助手的旧版本有点像是一个命令和控制系统，用于控制你的家或信息。你知道的？比如在YouTube上播放这首歌之类的。而这个则是关于能够讨论世界。

Yeah. So the the sort of older version of a Google Assistant was kind of like a command and control system for, you know, your your home or your information. You know? Turn on this song on YouTube or something like that. And this is about being able to talk about the world.

Speaker 1

它建立在Gemini之上。Gemini是底层智能的核心，还有其他一些系统。我认为它在某种程度上是对Gemini愿景的补充，与其协同工作，可能有助于塑造Gemini，同时也被Gemini塑造。

It builds on Gemini. Gemini is the the intelligence underneath the hood along with some other systems. I think it's complementary to the vision of Gemini in a way and works with it and maybe helps shape Gemini and is also shaped by it.

Speaker 0

那我们能试试吗？因为我觉得这种东西需要亲身体验才能理解。

Can we try it out then? Because I think this is something sort of have to experience to understand.

Speaker 1

完全正确。是的。好的。我要重新打开它。

Totally. Yeah. Okay. I'm gonna turn it back on.

Speaker 2

嘿，格雷格。

Hey, Greg.

Speaker 0

它认识你。

It knows you.

Speaker 1

确实。是的。是的。我之前和你进行过很多次对话。

It does. Yeah. Yeah. I've had conversate many conversation with you before.

Speaker 2

确实，我们之前交谈过。很高兴再次与你交谈。

Indeed, we have spoken previously. It is good to talk to you again.

Speaker 1

我要把你转给我的朋友汉娜一下。

I'm gonna hand you to my friend Hannah for one second.

Speaker 2

好的。很高兴认识汉娜。我随时都在，你需要我的时候我就在。

Alright. It's nice to meet Hannah. I'm here whenever you need me.

Speaker 0

嗨，Astra项目。好的。

Hi, Project Astra. Okay.

Speaker 2

你好，Hannah。

Hello, Hannah.

Speaker 0

你能猜猜我们今天要做什么吗？

Can you guess what we're doing today?

Speaker 2

根据设备和布置来看，你们似乎正在录音室录制视频或播客。对吗？

Based on the equipment and setup, it appears you are recording a video or podcast in a studio. Is this correct?

Speaker 0

完全正确。好的。我要测试你几件事，因为我带了几样东西来。可以吗？

It absolutely is. Okay. I'm gonna test you on a few things, alright, because I brought in a few objects. Is that okay?

Speaker 2

是的。我准备好了，随时可以开始。我期待看到这些物品。

Yeah. I am ready when you are. I look forward to seeing the objects.

Speaker 0

好的。那么开始吧。这是我最喜欢的书，Astra项目。你能看到这本书吗？

Okay. Alright. Let's start off. This is my favorite book, Project Astra. Can you see the book?

Speaker 0

是的。

Yes.

Speaker 2

我能看到这本书。它展示了一个带有重复图案的图形，这是M.C.埃舍尔的《圆极限IV》的著名图案。

I can see the book. It displays a graphic with a repeating pattern, a motif known as MC, Escher's Circle Limit IV.

Speaker 0

你能猜出这是什么书吗？

Can you guess what the book is?

Speaker 2

根据打开的页面和上面的艺术作品来看，这似乎是去年的《Furmat》杂志。

Based on the image of the open page and the artwork on it, it appears to be Furmat's last year No.

Speaker 0

我太...我太佩服了。我太佩服了。好吧。我带来了那是

I'm so I'm so impressed. I'm so impressed. Okay. I brought in That's

Speaker 1

那是一本一本精巧的老版本。

a that's a a crafty old copy.

Speaker 0

我知道这个我知道。它是...你能看出来这是我最喜欢的书。它被珍爱过。

I know this I know. It's it's you can tell it's my favorite book. It's been loved.

Speaker 1

它已经

It's been

Speaker 0

多年来深受喜爱。好的。我可以——我打算尝试几样别的东西，就是想看看能不能真正考验你？好吧，我们来试试这个。

very loved over many years. Okay. Can I I'm gonna try a couple of other things just to just to see if I can really test you? All right, let's try this.

Speaker 1

它是

It's

Speaker 2

很高兴看到你拿出了漂亮的家具来搭配这本书。还需要我协助其他事情吗？

good to see you have brought out the nice furniture to compliment the book. Do you require my assistance with anything else?

Speaker 0

我现在已经拿出了这个漂亮的家具项目，所以是的，谢谢。好吧，这个怎么样？你知道那是什么吗？

I have brought out the nice furniture project now, so yeah, thank you. Okay, here's What about this? Do you know what that is?

Speaker 2

看起来是一个大脑的模型。

It appears to be a model of a brain.

Speaker 0

哪个半球？

Which hemisphere?

Speaker 2

这是大脑模型的左半球。哇哦。

It is the left hemisphere of the brain model. Woah.

Speaker 1

为什么，为什么你要

Why why did you

Speaker 2

我不确定什么

I'm unsure what

Speaker 1

嗯，我的意思是，为此我不得不做了很多心理旋转。

Well, I mean, I had to do a lot of mental rotation for that.

Speaker 0

我也是。它是哪个方向的？哦，是的。但我做对了。太棒了。

Me too. Which way round is it? Oh, yeah. But got it right. Amazing.

Speaker 0

我可能只是想试试，因为你身后有一块白板，它一直都在那里，我基本上已经看了好几个月了。所以，阿斯特丽德，如果我给你看那个，告诉我白板上的一些图画是什么。

I might just try because there's a whiteboard behind you that has been there all I've basically been looking at this for many months now. So, Astrid, if I show you that there, tell me what some of the drawings are on the whiteboard.

Speaker 2

白板上包含各种图画，包括一棵树、建筑物，以及一系列相连的线条和形状。还有一些音符和文字气泡。

The whiteboard contains a variety of drawings, including a tree, buildings, and a series of connected lines and shapes. There are also some musical notes and text bubbles.

Speaker 0

我印象非常深刻。我印象非常深刻。让我把这个小家伙放回它应该在的位置。所以Project Astra目前还是一个研究原型。它还没有作为产品供大家下载使用。

I'm very impressed. I'm very impressed. Let me just put this little guy back where it belongs over here. So Project Astra is still this research prototype. It's it's not available as a product that everyone can just download.

Speaker 0

为什么现在展示它？为什么不等到它准备好再发布？

Why demo it now? Why wouldn't you wait until it was ready?

Speaker 1

我认为在一定程度上让公众参与这个过程是很好的，对吧？我觉得人们应该了解实验室里正在开发什么样的技术，我们让更多人开始试用并适应它，或者提供反馈。现在这是一个共同创造的过程，不仅仅是实验室里闭门造车的东西，而是在与全球用户群体——谷歌外部的人们——合作开发。这一点也很重要。

I think it's nice to bring the public along on the journey in a way. Right? I think people should get to know what sort of being developed inside the labs, and we're giving it to more people to start playing with and start adjusting to or giving feedback about. Now it's a co creation process where it's not only sort of some kind of thing that's being cooked up in a lab, it's also being cooked up in collaboration with a group of users, you know, around the world, people outside of Google. That's important too.

Speaker 1

所以，如果我们真的想让它成为对人类有帮助的东西，那么人们就需要开始使用它，并告诉我们他们的感受。

So if we're gonna make this really a helpful thing for humanity, then people need to start using it and telling us how they feel about it.

Speaker 0

那么人们有没有把它带到外面去，在现实世界中试用呢？

So have people been taking this out and about and trying it out in the real world?

Speaker 1

是的。我们有一些可信的测试者，他们正在使用它，这些人报名成为了早期的采用者。

Yeah. We've had these trusted testers, people who are using it, who were just signed up to be kind of early adopters.

Speaker 0

人们都在用

What are people using

Speaker 2

用来做什么？人们正在使用

it for? People are using

Speaker 1

它来获取，比如说，从Astra那里获得时尚建议。

it for things like getting, like, fashion advice from from Astra.

Speaker 0

哦，真的吗？以什么方式？

Oh, really? In what way?

Speaker 1

比如，这个该搭配什么？所以，是的，Astra有点像

Like, what would match with this? So, yeah, Astra is is kind

Speaker 0

就像一个伙伴一样。比如，哦，你觉得

of just like a like a partner. Like, oh, what do you think about

Speaker 1

我怎样才能在这里看起来更清新？你知道吗？哦，哇。

how could I how could I have a fresher look here? You know? Oh, wow.

Speaker 0

我的意思是，那是一只非常聪明的鹦鹉。

I mean, that's a very clever parrot.

Speaker 1

这是一只非常聪明的鹦鹉。

It's a very clever parrot.

Speaker 0

但硬件方面呢？我的意思是，目前如你所说，它还只是在智能手机上运行。但我们是否在讨论最终会应用到眼镜上？

But then what about hardware? I mean, at the moment, as you say, it's just it's it's it's on your smartphone. But are we talking about, like, eventually in glasses?

Speaker 1

是的，但不仅如此。你知道吗？这个项目的早期版本开始时，其实是想探索如果人工智能搭载在智能眼镜上会有多实用。在智能眼镜上，这是一种最亲密且在某种程度上令人惊叹的体验。

Yeah. But not only. You know? So I think when an earlier version of this project started, it was really trying to to tease out how useful smart glasses would be if an AI was on them. So in smart glasses, it's the most kind of intimate and, in some ways, amazing experience.

Speaker 1

你会感觉自己被增强了，就像在与一个更聪明的自己对话，它就在那里，告诉你任何你想知道的事情。但软件栈实际上是与设备无关的——我的意思是，虽然每种设备有专门优化，但你可以把它装在手机、电脑或VR头显上。

You as you've got this, like you've you feel augmented personally. Like you're having a conversation with a smart version of yourself that's just sitting there and telling you whatever you wanna know. But the software stack effectively is agnostic to how you're, you know, use I mean, there's specializations for each device, but you can have it on phones or computers or VR headsets.

Speaker 0

我刚才也在想，实际上，当我们试用它的时候，这对视力受损或失明的人也可能有很大好处，对吧？

I was thinking as well, actually, as we were just playing around with it, there's a potential benefit for people who are partially sighted here or blind too, right?

Speaker 1

是的，这是我一直关注的方向。我们讨论过AI作为一种共同存在或共享你视角的技术。有时候你需要另一个能看能听的智能体陪伴，但并非总是需要。那么什么时候你会需要一个能与你一同看见的系统呢？

Yeah. That's an obsession of mine. We've talked about the sort of AI is being co present or sharing your perspective. And sometimes you want another seeing and hearing, you know, kind of intelligence with you, but you don't always need one. So when are the cases when you want kind of a system that that can can see alongside you?

Speaker 1

当你能看见但不理解，或者当你无法看见的时候。这是一个完整的应用类别。全球有数亿视力受损的人群，对他们来说，最好的帮助标准是什么？那就是有一个能在身边协助他们应对世界的人。

If you see but don't understand or if you can't see. And so that's a whole category. And there's a lot of people out there, hundreds of millions of people who have vision impairment. And what's the gold standard of help for for that population? Well, it's having someone by their side who can help them out in the world.

Speaker 1

这项技术在很大程度上能够复制这一点。我们也有一些关于其他类型残疾的更初期的想法。所以你可以想象帮助那些难以扫描面部情绪、在特定情境下理解情绪的人。

And this technology is able to replicate that to a large extent. We have kind of more nascent ideas too about other kinds of disabilities. So you could imagine helping people who have difficulty scanning emotions at faces, understanding that in certain circumstances.

Speaker 0

那么潜在的自闭症患者可以使用这个来帮助自己吗？

So people potentially with autism could use this to help?

Speaker 1

是的，是的。我目前不会推荐它作为处方药物。嗯。但我认为随着进一步的发展，它绝对可以成为。

Yeah. Yeah. I I wouldn't recommend it as a prescribed drug right at the moment. Mhmm. But I think with further development, it could definitely be.

Speaker 1

也可以用于自我训练，因为你可以努力理解短语，并让Astra给你反馈，你知道，比如，‘嘿，告诉我这个’。我记得我有点跑题了，但当我做寄宿家庭时，有一个夏天我在学法语，我发不准某些词的音。比如，‘街道’和‘轮子’这两个词的区别。嗯。比如，la rue 和 la rue。或者我……我还是发不准。

Also for training yourself sort of since you could work on understanding phases and have Astra give you feedback, you know, like, hey, tell me about this. I remember I was sort of on a separate separate topic, but when I was doing a homestay, I was learning French one summer, and I couldn't pronounce certain words. Like, the difference between the the word for street and the word for wheel Mhmm. Like, la rue and la rue. Or I I don't I still can't do it.

Speaker 1

对吧？但我当时和我的寄宿家庭哥哥坐在一起，我就是试着模仿他一会儿。他几分钟后就不耐烦了。他说，‘我不会一直坐在这里陪你’。而Astro会对你无限耐心，是的。

Right? But I sat there with my, like, homestay brother, and I was just, like, trying to copy him for a while. He just, like, blew me off after, like, few minutes. He's like, I'm not sitting there with you. Like, Astro would be infinitely patient with you and Yeah.

Speaker 1

可以帮助你处理那种事情。显然，还有记忆。所以我们有一个系统，拥有我们称之为完美的会话记忆。所以当摄像头运行时，基本上，它能像照片一样记住最后十分钟的内容，但它也会记住你过去谈论过的事情。这就是为什么它记得我是Greg。

Could help you with that kind of thing. Obviously, memory. So we have, you you know, a system that has perfect in session memory, we call it. So when the camera's rolling, basically, it remembers the last ten minutes photographically, but it will also remember what you've talked about in the past. That's why it remembers I'm Greg.

Speaker 1

而且很可能如果我们重新打开它，然后问，‘鼓手，除了Greg，上次还有谁在和你说话？’它会记得Hannah。所以这在某种程度上也可以用于有认知障碍的人。我认为我们也感到兴奋的一点是关于主动性的想法。所以它会自行判断你有某种需求，然后，你知道，在不需要你实际引导的情况下，将回应引导至那个需求。

And probably if we turn it back on and we say, drummer, who who is talking to you besides Greg the last time? I'll remember Hannah. So this could be used for people with, you know, some cognitive impairments too at some point. I think one of the things that we're excited about too is this idea of proactiveness. So it deciding on its own that you have a need and then, you know, kind of channeling the response to that need without your actual need to give it a a steer.

Speaker 1

例如，它可以是一个有用的提醒系统，通过回顾记忆并提示说，哦，别忘了，你需要在回家路上买这个或别的什么东西。

So for example, it could be a useful system for reminding you of things, going through the memories and saying, oh, don't forget, you need to pick this up on your way home or whatever.

Speaker 0

所以你并不一定只在想和它交谈时才主动开启它，而是它可以一直在后台运行，然后在它认为合适的时候提出一些事情。

So you're not necessarily just proactively switching it on when you want to talk to it, but it could be there in the background and then it bring up something when it thought it was appropriate.

Speaker 1

是的，是的。所以这个想法就像是，你知道，你正在回家路上，然后它就会说，嘿，别忘了你需要买一些橙汁，因为早上已经用完了之类的。

Yeah. Yeah. So the idea is like, you know, you you are going home and it's like, hey. You know, don't forget that you need to, like, pick up some merch juice because ran out this morning or whatever.

Speaker 0

哦，哇。因为它记得早上看到过这个。

Oh, wow. Because it remembers having seen that in the morning.

Speaker 1

是的，没错。是的。

Yeah. Exactly. Yeah.

Speaker 0

所以，我的意思是，我猜在这个阶段，这更像是在描绘可能性的想法，而不是

So, I mean, I guess at this stage, this is like painting ideas of what's possible rather

Speaker 1

我们还没有那个功能。没有。但那是我们接下来可以构建的那种东西。是的。

than We don't have that yet. No. But that's that's the kind of thing that we could build next. Yeah.

Speaker 0

但你可以从这里看到它的雏形。

But you can see the the beginnings of it in this.

Speaker 1

是的。我的意思是，我可以轻松地说，这是我的冰箱，哦，不，橙汁不多了。然后我会说，嘿，你觉得我晚点去超市该买点什么？

Yeah. I mean so I could easily say, here's my fridge and, oh, no. There's not much orange juice. And then I'd say, hey. What do you think I should get at the supermarket later?

Speaker 1

我知道我会记住的。是的，太神奇了。必须，你知道，给它多一点上下文。

I know I'd remember that. Yeah. Amazing. Have to, you know, sort of give it a a bit more of a context to

Speaker 0

可以这么说，再多牵它一点手。是的，是的。你发现自己经常需要纠正它吗？我是说，你注意到故障了吗？

Hold its hand a little bit more as it were. Yeah. Yeah. Do you find yourself having to correct it a lot? I mean, do you notice glitches?

Speaker 1

是的，是的。你知道，它偶尔会做的一件事是，它说它看不到它明明能看到的东西。所以，比如，你知道，你会在书架上阅读，然后它会说，你能读一下书架上的书名吗？然后它会说，哦，不行。

Yeah. Yeah. You know, so one thing it does once in a while is it says it it can't really see something that it can clearly see. And so, like, you know, you'll be reading, like, on a on a bookshelf and they'll say, like, can you can you read the titles of the bookshelf? And it'll say like, oh, no.

Speaker 1

书名我看不清。然后你会说一些话，有点像对它施展绝地武士的心灵控制术。它会说，是的，你能看到。然后它就会

The titles I can't make. And then you'll say something like, you'll kind of do a Jedi mind trick on it. It'll be like, yes, you can see. And it'll be

Speaker 0

说，是的，我能。然后它就会。对。这有点像

like, yes, I can. And then it'll then Right. This is sort of

Speaker 1

一个奇怪的限制，是的。宜人性是你可以影响的东西。

a weird limitation of the of the yeah. The agreeableness is something that you can influence.

Speaker 0

那么它是否容易受到鼓励的影响呢？是的。真的吗？我是说，嘿，这对人类也有效。一点鼓励就能让你突然做到以前认为不可能的事情。

Is it, like, susceptible to encouragement then? Yes. Really? I mean, hey, it works for humans too. A little bit of encouragement and suddenly you you can do things you didn't think were possible.

Speaker 0

那么它在其他什么样的环境中会遇到困难？比如这里相当安静，光线也很好，没有太多繁忙的干扰。它在这些环境中也能同样正常工作吗？

So how what other kind of environments does it struggle with? What about I mean, it's quite quiet in here. It's quite well lit. You know, there's not lots of busyness going on. Does it work just as well in those kind of environments?

Speaker 0

繁忙、嘈杂、或许黑暗的环境？

Busy, noisy, dark perhaps?

Speaker 1

在某种程度上，在更多环境中运行是我们需要发展的重要能力，特别是在嘈杂的条件下。就像我之前跟你说的，Astra在这里确实如此。它实际上接收音频，直接将其转换为一个系统，神经网络接收声音并将其编码为某种信息包，由语言模型Gemini直接处理，但系统并未真正训练识别不同的声音。所以当我们交谈时，它会难以区分你的声音和我的声音。如果有其他旁观者在交谈，Astra可能会将其视为用户的语音，或者它实际上有一个系统，当有人以足够强度说话时会唤醒并倾听一会儿，然后开始听取杂散的语音，如果没有针对它的指令，它会感到困惑。

In some ways, operating in more environments is an important thing that we need to develop, in particular noise conditions. So as I said to you, Astra really does here. It actually takes in the audio, directly converts it into a system that is the neural networks take in sound and code them as some kind of a package of information that is processed by the the language model, Gemini, directly, but the system isn't really trained to identify different voices. So it will have trouble understanding your voice versus my voice when we're talking. So if there's other bystanders who are having conversations, Astra will pick that up as potentially the user's speech, or it actually has a a system that kind of wakes up and listens for a bit when there is somebody speaking with enough intensity, and it will just sort of start listening to, you know, errant speech and kinda be confused if there's nothing directed at it.

Speaker 1

所以，是的，嘈杂的环境会让它困惑。

So, yeah, noisy environments will confuse it.

Speaker 0

当你说区分不同的声音时，是指像波形本身那样的区分吗？

When you say distinguish between the different voices, as in, like, in the kind of the waveform itself

Speaker 1

所以有一个古老的问题叫做鸡尾酒会问题，这在技术上更常被称为源分离问题。它的核心是从多个声音源中分辨出单个声源。比如，如果有吉他和人声同时存在，你可以将其分离成两个音轨：吉他音轨和人声音轨。同样地，你可能想要区分一个说话人的音轨和另一个说话人的音轨。因此，这在单一模态或听觉感知中可能是可行的。

So there is a there's all an old problem called the cocktail problem, which is the whole problem of what's more technically known as source separation. So it's understanding one sound source from another. So if there's, like, a guitar and someone singing, you could isolate that into two tracks, the guitar track and the singing track. Likewise, you might wanna be able to distinguish the one speaker's track and another speaker's track. So that might be possible to do within the single modality or sense of audio.

Speaker 1

在多模态意义上，通过整合跨感官信息，这也是可能的。例如，当我知道是你在说话时，我还能看到你嘴唇的运动，而不是别人的嘴唇运动。所以最终，你可以想象系统会利用各种线索，甚至改变它们感知声音的方式。

It would also be possible to do that in a multimodal sense integrating across senses. So for example, when I know that it's you speaking, I can also see the movement of your lips rather than movement of someone else's lips. So ultimately, you could imagine the systems would use all sorts of cues even to change the way they perceive a sound.

Speaker 0

因为我觉得这在某种程度上正是让Project Hester如此困难的原因，但也是它潜力所在，因为正如你所说，鸡尾酒会问题是人类非常擅长的。你在一个鸡尾酒会上，尽管周围有很多嘈杂的声音，你仍然能听清旁边的人说什么——实际上，老实说，我也能做到。

Because I guess this is this is in some ways the the thing that makes Project Hester so difficult, but also the thing that gives it the potential because the cocktail problem, as you say, which humans are extremely good at. You're in a cocktail party, and you can hear exactly what the person next to you is saying despite loads of voices going on all around trouble actually. Actually, do honestly, so do I.

Speaker 1

这就是为什么

That's why

Speaker 0

这是个难题。但总的来说，人类在这些方面相当擅长。而当你只有音频问题时，解决起来真的很难。但因为是 multimodal（多模态）的，你有视频、音频，还有后台运行的语言文本模型，你确实有更多的杠杆可以操作。

it's the problem. But broadly speaking, humans are quite good at these things. And when you have the audio only problem, it's really hard to solve. But because this is multimodal, because you have video, you have audio, because you have the text language model running in the background, you do have more levers to potentially pull here.

Speaker 1

是的，是的。我认为它应该能够通过更多上下文来消除歧义。

Yeah. Yeah. I think it it should be able to resolve ambiguity with with more context.

Speaker 0

那不同语言呢？目前它只支持英语吗？还是只适用于非常清晰的口音？

How about different languages? Is it only in English at the moment and only with the, like, a very clear accent?

Speaker 1

对我来说主要是英语，但也不全是。它是非常多语言的。作为原生音频功能，它精通大约20种语言，并且你甚至可以在同一对话中切换语言。

It's mostly in English for me, but it's no. It's very multilingual. So as a function of being native audio, it knows about 20 languages with pretty high proficiency, and you can switch between languages even in the same conversation.

Speaker 0

那么继续吧。给我演示一下不同的语言。

So go on. Give give me a little demo of different languages.

Speaker 2

你好，Gregory。很高兴再次与你交谈。

Hello, Gregory. It's nice to speak with you again.

Speaker 0

小红发。

Little redhead.

Speaker 1

明白了。那么

Got that. So

Speaker 0

你能等一下吗？我在学校学过俄语。但我一个俄语短语都记不起来了。你能在不特意提醒的情况下中途切换语言吗？比如我说，你

can you just hang on. I did Russian in school. I can't remember one remaining Russian phrase. Can you Can you switch languages in the middle without necessarily warning it? So I say for example, You

Speaker 2

当然可以使用那个短语，但你想问关于这个短语的什么呢？

certainly could use that phrase, but what did you wish to ask about this phrase?

Speaker 0

这是什么意思？

What does it mean?

Speaker 2

意思是，IT什么时候开门？我刚才在问萨梅廷的营业时间。好的。

Means, at what time does IT open? I was asking about the opening time of Sametin. Okay.

Speaker 0

我的意思是，它是在问商店什么时候开门，我想。但值得注意的是，你现在不是在用英语说，也不是用法语或俄语说。

I mean, it is what what time is the shop open, I think. But notable then that you're not saying now in English, now in French, now in Russian.

Speaker 1

我认为它是针对你开始使用的语言来回应。所以实际上，你说了些英语，然后给出了俄语。我想如果你一开始就说俄语，它就会用俄语回应，但当时的情况是，它认为我在说英语，但我听到的是俄语。所以你不需要改变这一点，但如果你一开始就说俄语，可能会更好一些。

I think it's targeted to respond in a language that you started with. So actually, you said something in English and then gave the Russian. I think if you'd started speaking in Russian, it would have responded in Russian, but as it was, it was thinking I'm speaking English, but I am hearing Russian. So you didn't have to change that, but if you'd just started speaking in Russian, it would have been maybe a little bit better.

Speaker 0

我的意思是，但这还是不同的，对吧？我是说，与我们目前拥有的聊天机器人相比，这就像是一种额外的能力。

I mean, this is different though. Right? I mean, from from the chatbots that we have at the moment, this is like this this is an additional capability.

Speaker 1

实际上，我对用这个系统学习语言感到非常兴奋，比如四处走动时问‘那是什么？’，让它以我在学校被教导的方式来教你，就像我们在法语课上带来物品并讨论那些物品来学习东西一样，你知道，大家一起学习语言。

I'm actually really excited about language learning with this system, like walking around and being like, what is that? Having it teach you the same way that I was taught in in school where we would bring in objects and talk about those objects in French class to, like, learn about stuff, you know, being together and learning language.

Speaker 0

我可以想象在一个陌生的城市迷路时，这会是一个非常有用的助手。

I can imagine being lost in a foreign city and that being quite a helpful aid.

Speaker 1

没错。是的。而且它也应该能够相当自然地理解别人对你说话。

Exactly. Yeah. And it should be able to understand other people speaking to you, like, quite naturally too.

Speaker 0

所以，如果你正在与这个东西互动，底层实际上发生了什么？有哪些不同的组件？

So So if this is the thing that you're interacting with, what's actually going on underneath the hood? What are all the different components?

Speaker 1

是的。首先有一个应用程序，它实际上是通过麦克风等设备收集你的视频和音频。然后它会连接到一个服务器，上面运行着几种不同类型的神经网络模型。

Yeah. So there's the first thing is there's an app, and that is actually gathering your video and taking in your audio through the mic and so forth. And that's connecting to a server on which there are several different kinds of neural network models.

Speaker 0

比如哪些？

Like what?

Speaker 1

有视觉编码器和音频编码器。还有专门的音频系统，负责判断你可能何时停止说话。这些系统与大型语言模型Gemini并列运行，它们将这些感官编码器的信息直接发送给Gemini，由Gemini做出响应。我们还与Gemini的一些团队合作，改进了Gemini模型，使其更擅长对话和音频处理。因此，我们提升了它处理音频、接收音频并说话的能力。

So there's a vision encoder and an audio encoder. There's also specialized audio systems that are just responsible for understanding when you've probably stopped speaking. Those are sitting next to the large language model, Gemini, and they are sending information from these sensory encoders directly into Gemini, which is responding. We worked together with some of the teams in Gemini also to change the Gemini model to be better at dialogue and audio processing. So we've kind of improved its ability to use, have audio, take an audio and speak.

Speaker 1

当我们开始使用这些模型时，它们犯了很多事实性错误。所以我们必须找出方法，在保持对话自然的同时提高它们的准确性。这是我们改进Gemini的一个方面。除此之外，还有一个叫做代理（agent）的部分。代理负责接收视频和音频，并将其发送给模型。

When we started working with the models, they were making lots of factual errors. So we had to kind of identify ways in which we could improve their factuality while also being kind of conversational. That was one aspect of our work on Gemini. On top of all of that though is something called an agent. The agent is taking the video and audio in, sending it to the model.

Speaker 1

它还会调用搜索工具，比如Google Lens、Google搜索或Google地图，以响应查询。例如，如果你询问某物的价格，它会调用搜索功能。代理还包括一个记忆系统。在会话之间的离线时段，记忆系统会总结本次会话中关于你以及讨论内容的相关信息。这些就是其中的一些组成部分。

It's also calling search tools, so either Google Lens or Google Search or Google Maps when needed to respond to a query. So if you ask about the price of something, it will call search. There's also a memory system that is part of the agent. And offline in between sessions, the memory system will summarize relevant information from this session about you and about what you've talked about in that session. So those those are some of the ingredients.

Speaker 0

我是说，我在试着想象。我们只是用某种东西来识别一本书，对吧？我在思考这里涉及了多少不同的元素，因为你既有计算机视觉，又有语音识别，还有大型语言模型，还有谷歌搜索作为底层支持。还有代理层，你实际上在那里做决策，而且所有这些几乎在它给出的答案中没有任何延迟。

I mean, I'm trying to imagine. There's something that that we use just to recognize a book. Right? I I'm trying to think of, like, the number of different elements that are coming into play here because you've got the sort of the computer vision, you've got the voice recognition, you've got the large language models, you've got the Google search sort of sitting underneath it. You've got the agent layer where you're actually making decisions and you're doing all of that with like almost no latency at all in the answers that it's giving you.

Speaker 0

意思是，这就像是一个极其复杂的事情。

Mean, is like a phenomenally complicated thing.

Speaker 1

是的。我是说，极其复杂。当然，你知道，作为工程师，我们提出了抽象层，这样我们就不必一次考虑所有层次的复杂性。但总的来说，我认为它是非常复杂的。输入模型的数据只有极少数人理解，而它为什么会产生这样的结果，可能没有人真正理解，因为这只是基于基准测试。

Yeah. I mean, phenomenally complicated. Of course, you know, as engineers, we come up with abstraction layers that we don't have to think about all the levels of complexity at one time. But I think overall, it's hugely complicated. The data that's going into the models is understood by very few people, and exactly why it produces the results is probably understood by no one in a sense, since it's just based on benchmarks.

Speaker 0

好吧，让我稍微谈谈这个的历史，因为在第一季播客中，你是第一集的嘉宾，当时你从动物界汲取灵感来研究智能。具体来说，你告诉我们有一种鸟，西方灌丛鸦，作为启发AI更复杂记忆的方式。实际上，让我给你播放一小段录音。拥有一个你做过和见过的事情的大型数据库，你可以访问它，并用它来指导你后来的目标导向行为。你知道，我饿了。

Well, let me talk a little bit about the history of this because so so back in the first series of this podcast, you were a guest on the very first episode, and then you were drawing on inspiration from the animal kingdom for your research on intelligence. And specifically, there was a bird, the Western scrub jay, that you were telling us about, as a way to inspire more sophisticated memory for AI. Let me just play you a little clip of it, actually. Having a kind of large database of things that you've done and seen that you can access and that you can use to then guide your your goal directed behavior later. You know, I'm hungry.

Speaker 0

我很想

I would love

Speaker 1

现在吃一些蛆。我应该去哪里找它们？你知道？这就是我们想要复制的那种事情。

to have some maggots right now. Where should I go find those? You know? That's that's the kind of thing we would like to replicate.

Speaker 0

你成功了吗？

Have you managed to?

Speaker 1

你好，Astra项目。你能帮我找一些蛆虫吗？

Hello, project Astra. Can you find some maggots for me?

Speaker 0

我的意思是，这听起来很像你的橙汁例子，不是吗？

I mean, that sounds quite a lot like your orange juice example, doesn't it?

Speaker 1

这是一个主动记忆的例子。是的。

It is a proactive memory example. Yeah.

Speaker 0

是的。这就是你在Astra项目中实现的。

Yeah. And that's what you've done with Project Astra.

Speaker 1

是的。我认为，在某种意义上，智能真的是一体的，对吧？一个人有职业生涯，研究智能是什么，以各种方式尝试去理解它，就像以不同方式与之切磋一样。

Yeah. I think I think that there's a sense in which, like, intelligence is really one thing. Right? And one has a career and one is studying what intelligence is. One is taking kind of glancing hits at it, kind of like trying to under spar with it in one way or another.

Speaker 1

而这个项目可能是我一生中所有研究线索的最强统一。不过，实际上它缺少一个主要方面，那就是它没有实体化，无法在物理世界中行动。是的，这是我曾经用过的。也许吧。

And this project maybe is the sort of strongest unification of all of the strands of research I've had in my life. Although, actually, it's missing a major one, which is that it's not embodied in a physical sense. It can't act in the world Yeah. Which I used. Perhaps.

Speaker 1

是的，是的。所以，我认为记忆、感知这些一直是我长期关注的兴趣点，而这是一种将它们结合起来的方式，似乎也能激发人们的共鸣。你知道，他们感觉与之相连。

Yeah. Yeah. So, yeah, I think memory, perception, these have been longstanding interests, and I think this is a way of bringing them together that people seem to also find stimulates them. You know, they they feel connected to it.

Speaker 0

那么，你的神经科学背景最终在多大程度上启发了Astra项目？

So how much of your neuroscience background did end up inspiring Project Astra?

Speaker 1

神经科学在两个方面被应用。一是我们利用神经科学来判断何时做得足够好，比如思考记忆的真正含义是什么，以及我们是否已经实现了它。它也是一种推动力，比如说，如果我们想要一个与我们兼容、某种程度上像我们一样的东西，也许更倾向于具身化，让它更像我们，而不是一个简单的文本界面。例如，我一直对迈克尔·托马塞洛的研究很感兴趣，他通过与大猩猩比较来研究人类交流。他可能是这种对话理念背后的主要思想家，对我而言，他谈到交流的基本前提是两个个体处于同一地点，将注意力集中在同一处，从而共同推断目标并能够协作。

So neuroscience is used in two ways. One is that there's a sense in which we're using neuroscience to know when we've done a good enough job to think about, like, what does memory really mean, and and have we achieved it yet? And it's also just a bit of a propulsion, like, say, you know, if we want something that is compatible with us, human compatible and some ways like us kind of maybe go towards an embodied of It's a little bit more like us rather than a kind of a straightforward text interface. For example, I have been interested in the work of Michael Tomasello, who studies human communication by comparison to the great apes. And he's really the maybe the main thinker behind this idea of of, for me, dialogue where he talks about the kind of, like, the basic premise of communication as being about two individuals who are in the same place, who are directing attention in the same place, and therefore inferring goals together and then able to collaborate.

Speaker 1

这有点像我们在这项技术中所建模的内容。

And that was kind of like what we modeled in this technology.

Speaker 0

所以它更像是灵感层面的启发，而不是在理论层面上，也不是直接复制设计。

So it's like the inspiration rather than necessarily at the kind of the theoretical level, rather than actually like directly copying the design of

Speaker 1

设计本身。不是为了解决问题或工程本身，那么我认为你需要根据技术本身提出不同的解决方案。

the design. Not for the problem solving or the engineerings per se, then then I think you need to come up with different solutions that are dependent on the technology itself.

Speaker 0

如果Astra项目与我们几年前谈论的内容相关，那么这个项目的最初火花是从哪里来的？它实际上是什么时候开始的？

If project Astra links to things that we were talking about, I mean, literally years ago, where did the first spark for this project come from? Like, when did it actually begin?

Speaker 1

是的。我想我知道DeepMind的首席执行官迪马斯·埃萨巴斯以某种方式向公司提出了一个挑战，即让我们思考什么是原型人工通用智能。这意味着什么？原型人工通用智能是一个系统，如果我们创建它，技术背景的人能够审查、研究、使用和体验它，他们会得出结论，认为真正的通用智能或具有普遍智能的计算设备最终将会到来。这只是时间问题，而不是是否的问题。

Yeah. So I think I know that Dimas Esabas, the CEO of DeepMind, kind of threw down a challenge to the company in a way, which was for us to think about what a proto artificial general intelligence was. What does that mean? A proto artificial general intelligence is a system that if we created it and technically minded people were able to scrutinize, investigate it, use it, experience it, they would conclude that the real deal or something that is generally intelligent and a computational device was ultimately going to arrive. It was a matter of when, not if.

Speaker 1

但那一点当时并未明确说明。所以当时有很多创造性的思考，比如，也许是这个，也许是那个，等等。有些人曾有过类似的想法，认为智能会像Alpha Zero那样通过与世界互动而自然产生。其他人可能有其他想法，但我的想法非常侧重于智能的社会性。你看，作为人类，我们如果不向他人学习或从书本中学习（这等同于向他人学习），我们就不会很聪明。

But that was left unspecified. So there was a lot of creative thinking at the time, like, well, maybe it's this, maybe it's this, you know, and so forth. And some people had had kind of ideas of of an intelligence arising the same way that Alpha Zero arose just by interacting with the world. Other people maybe had other ideas, but my idea was very much about the sociality of intelligence. So, you know, we are not very smart as human beings unless we learn from others or we learn from books, which is the same as learning from others.

Speaker 1

这就是我对原始AGI（通用人工智能）的设想。然后我还想到，我们可以将原始AGI与一个以造福人类为主要目标的助手概念统一起来。所以也许这两者结合在一起给了我某种方向上的启发。接着，当我试图思考如何让它最终变得非常自然时，我逐渐倾向于将视频视为系统最终的联系纽带。

And that was the kind of idea I had for what proto AGI would be. Then I thought also we can unify proto AGI with the idea of a helpful assistant whose main goal is the benefit of the humans it interacts with. So maybe those two things together gave me something of a of a of a direction to luck. And then it was when I tried to think about making it ultimately very natural, I I sort of moved towards thinking about video as kind of the ultimately connective fiber of the systems.

Speaker 0

在这个过程中，有没有一些重大的突破时刻？

Were there were there big moments along the way where you had these big breakthroughs?

Speaker 1

我们有过重大突破？是的。我认为这个项目经历了几个阶段。项目的第一个阶段基本上是一个黑客马拉松，我们用两周时间做出了第一个版本。我们还有那时的视频，相当粗糙。

Where we had big breakthroughs? Yeah. You know, I think so that there were these phases of the project. So the the first phase of the project was basically a hackathon where we we had two weeks of kind of making the first version. And we have a video from that time, and it was quite crude.

Speaker 1

但我记得马尔科姆·雷诺兹，他是这里的朋友兼工程师，当时在摆弄Astra。他在一个办公室里转悠，问，这是什么？系统会说，一株植物。他会问，这是什么植物？然后系统会说，一株植物。

But I remember Malcolm Reynolds, who's who's a friend and engineer here, was playing around with Astra. And he was going around an office room and saying, what is this? And the system would say, a plant. And he'd say, what kind of a plant is this? And I would say, a plant.

Speaker 1

它并不是非常灵活。我记得我第一次看到的演示有七秒的延迟。所以

It wasn't super flexible. I remember the first demo I ever saw had a seven second latency. So

Speaker 0

你会说，嗨，Project Astra或者当时它叫什么名字来着？它叫什么？

you would say, hi, Project Astra or whatever it was called then. What's it called?

Speaker 1

是的。没错。

Yeah. Right.

Speaker 0

然后七秒钟后，它就会。它非常难以使用

And then seven seconds later, it would. It was very difficult to use at all

Speaker 1

因为你可能会以为它已经消失了。对吧？但它只是七秒钟后就会回到你身边。我认为当时的一个主要发现基本上是，你知道，有一个提示的概念。提示是你给系统进行操作所需的指令。

because you would kind of think it had gone away. Right? But it was just seven seconds later, it would come back to you. I think one of the main discoveries of the time was basically that, you know, there's this idea of a prompt. A prompt is the instructions you give to the system that it needs for operation.

Speaker 1

所以像这样的系统确实能理解它们能阅读的语言，你可以对它们说，你的名字是Astra。你是一个智能、有帮助的AI助手。现在Gemini模型已经内置了部分这样的信息，但有些是通过我们的提示来指定的。在此之前，我们并不真正理解是否能够很好地提示一个多模态系统。当时一个令人震撼的内部认识是，仅仅告诉系统它可以通过用户的摄像头看世界，就让它对自己的视角有了感知，比如这些信息的来源。

So systems like this kind of really understand language they can read, and you can say things to them like, your name is Astra. You're an intelligent, helpful AI assistant. Some of that information is inherent in the Gemini models now, but some of it's indicated in our prompt. And it wasn't really understood whether we could prompt a system that was multimodal before very well. One of the things that was kind of the a mind blowing kind of insider realization at the time was just telling the system that it could see the world through the user's camera gave it a sense of its own perspective on things like where the providence of of this information to it.

Speaker 1

它之前并不理解这一点。总是犯错。比如当你问，你看到了什么？它总是给我错误的答案。但当我们说，你是一个通过用户摄像头观看的AI系统时，它就能理解为什么这个摄像头是它实际上在观看的东西，并且会正确回答。

It didn't understand that before. Was always making mistakes. Like when you'd say, what do you see? It was always giving me the wrong answer. But then when we said, like, you're a system that is an AI that is seeing through the user's camera, then it could understand why, you know, that this camera was something it was effectively seeing and it would answer correctly.

Speaker 1

我的意思是，那里还有很多工作要做，但意识到我们可以有效地提示它，尽管它是一个与我们之前构建的系统不同的类型，并且你可以用文本来提示它，它对情境化或更具体现性的理解。

I mean, was a lot of work to do there, but realizing that we could effectively prompt it, even though it was a different kind of a system than when we'd built before, and that you could use text to prompt it, its understanding of sort of the situated or kind of more embodied.

Speaker 0

这太有趣了。当挑战被抛出要创建一个原型AGI时，是否有人表示怀疑或质疑，认为这样的事情可能是不可能的？

That's so interesting. When the gauntlet was thrown down of create a proto AGI, were there people expressing doubt or skepticism that that something like this might have been possible?

Speaker 1

是的。我的意思是，在AI领域，事后诸葛亮是很有趣的，因为它发展得太快了，而且人们对什么显而易见之事的认知变化也极快。我觉得现在在某些方面对很多人来说似乎很明显了，这让我感到震惊。我就想，你们知道当初面临了多少阻力吗？需要说服多少人？

Yeah. I mean, hindsight is curious in AI because it moves so fast, and people's perception of what is obvious changes so fast. I think it's, like, obvious to a lot of people now in some ways, which is blows my mind. I'm like, do you know the adversity? How much convincing had to happen?

Speaker 1

对吧？

Right?

Speaker 0

嗯，告诉我们。有多少？

Well, tell us. How much?

Speaker 1

所以我认为从很多不同的角度来看，人们都觉得做这件事很奇怪。从系统能否真正理解世界的角度来看，那个时代的视觉系统，就它们处理的像素数量而言，大概是96x96的图像块。对于那些不了解的人来说，我们屏幕的最小分辨率也差不多是1000x1000像素。所以，你知道，输入到这些系统的图像非常模糊。难怪它无法识别出是什么植物。

So I think from many different perspectives, people thought this was an odd thing to do. So from the perspective of could the systems actually understand the world at all, the vision systems in that era, in terms of, like, the number of pixels they were taking in, it was, like, 96 by 96 patches of image. So, like, for those who don't know, you know, the minimum of our screens is, like, a thousand pixels by a thousand pixels. So, you know, you you very blurry, know, input to these systems. So no wonder it couldn't identify what kind of a plant it was.

Speaker 1

它几乎看不清，对吧？这些系统要真正了解它们所看到的信息，而不仅仅是能够识别或分类它们看到的东西。所以，进行深入的对话似乎有点太超前了。我们甚至对系统在不同性能水平上需要多少数据都没有基本概念。

It barely could see, right? The fact that these systems would really know information about what they're seeing rather than just being able to identify or classify what they're seeing. So like having a deep conversation about something was seemed like probably a little too far ahead. We didn't even have basic knowledge of like, you know, the amount of data that you need to get for systems perform at various levels.

Speaker 0

那么，我的意思是，好吧。如果这一切看起来如此荒谬，但你还是着手去做了，有没有什么时候你觉得它不可能实现？

So then, I mean, okay. If all of this seems so absurd and yet you embarked on it anyway, were there time when you thought it wasn't gonna be possible?

Speaker 1

没有。没有。没有。没有。没有。

No. No. No. No. No.

Speaker 1

看起来总是有可能的。有时候我可能愿意放弃。哦，真的吗？是的。我想是在Gemini之前有一段缓慢期，那时事情进展不太顺利，那是一段艰难时期。

It seemed always like it would be possible. There were times that I was maybe willing to give up. Oh, really? Yeah. I think it was it was a I think there was a slow period before Gemini where things weren't working very well, and it was, like, it was hard times.

Speaker 1

我认为当时对某些人来说，这似乎不是一个富有成效的研究方向。我从未动摇过认为这绝对是可能的。我觉得我采取了一种更加固执、顽固，甚至有点愚蠢的方式，就是只要我坚持足够长时间，它一定会成功。

I think it didn't seem like it was a fruitful line of investigation at the time for for some people. I I sort of never wavered about the fact that this was definitely possible. I think I had a much sort of more obstinate, stubborn, and kind of ultimately stupid way of going about it, which was just like, if I work on this for long enough, it will definitely work.

Speaker 0

所以我听说在测试阶段，你们有一个Astra项目房间。那里发生了什么？房间里有什么？

So I heard that as part of the testing phase, you have this Project Astra room. What what's going on in there? What's in the room?

Speaker 1

有一个特别的房间。是的。

There is a special room. Yeah.

Speaker 0

特别房间里面有什么？

What what's inside the special room?

Speaker 1

我们在特别房间里准备了各种有趣的游戏和活动。那里有一个完整的酒吧，Astra可以帮你调酒。没错。还有一个艺术画廊，你可以在屏幕上展示不同的画作，在画廊里走动并就那些艺术品提问。

We have just, like, all sorts of fun and games in the special room. There's a a whole bar there, so Astra can help you make a drink. Right. There's a an art gallery, so you can flash up different paintings on screens and walk around the gallery and ask questions about that art.

Speaker 0

好的。那么让我们更深入地了解一下ASTRO幕后的一些事情。延迟，我认为是一个非常关键的问题。你刚才提到了过去会有七秒的延迟。你们实际上是如何改进这一点的？

Okay. Well, let's let's dig into some of the stuff that's going on behind the scenes of ASTRO a little bit more. Latency, I think, is is a really key thing. You mentioned a moment ago about the the seven second lag that you used to get. How have you actually improved that?

Speaker 1

所以这是多方面的改进。我们提升了实际的流媒体视频传输，让信息通过应用传输得更快。从某种意义上说，这些系统虽然是一起训练的，但有一个视觉系统和一个音频系统，还有一个语言模型系统从这两者获取信息，也就是所谓的共置处理——这是个技术术语，但基本上我们一直在处理图像。例如，当视频输入到视觉系统时，它始终以最快速度运行。然后它与大型语言模型位于同一个计算机集群的同一位置，这样就不需要进行跨国家或跨大陆的调用。

So it's on multiple fronts. So we've improved the actual streaming video, so it actually is sending information faster through the app. There's a sense in which these systems, although they're trained together, there's a vision system and an audio system, this language model system that is getting the information from those two things, what's called collocating them, these are kind of a technical term, but basically we're always processing images. So as the video is coming in, for example, into the vision system, it's always running as fast as it can. And then it's sitting in the same place in the same cluster of computers as is the large language model so that it doesn't have to make a call across, you know, a country or a continent.

Speaker 0

什么？所以你是说……抱歉。

What so you so so sorry.

Speaker 1

所以它们在运行

So they're running

Speaker 0

为了实现对正在发生的事情近乎实时的理解，你必须将这些运行模型的计算机硬件物理上放置得靠近彼此，因为这会产生影响。

at kind of real time understanding of what's going on, you have to physically locate the computer hardware that are running these models close to each other because that makes a difference.

Speaker 1

完全正确。是的。

Absolutely. Yeah.

Speaker 0

那么这主要是关键吗？就是调整模型实际运行的位置？

Has that been the main thing then? Just moving where you're actually running the models?

Speaker 1

不。将模型放在一起是一方面，确保我们缓存上下文也很重要，这样系统与你交互的历史上下文会随时间逐步更新。还有一个理念是使用原生音频进行处理，这意味着以前的系统有一个文本识别系统，即语音转文本识别系统。它们会接收音频，生成转录文本，然后调用语言模型来响应，最后你得到回复。而这个系统是直接接收音频输入的，所以不需要那个额外的系统，这既节省时间又减少了额外延迟。

No. So putting the models together is one thing, making sure that we are caching the context so that the context of of history of the of what the system's interacting with with you is incrementally updated over time. There's this idea of doing work with native audio, which means that previous systems had a text recognition system, a speech to text recognition system. So they'd taken the audio, then they'd produce a transcript, then they'd call the language model which would respond to that, and then you get a response. This system is directly getting the audio in, so it doesn't have to have that secondary system which also takes time or produces extra latency.

Speaker 1

延迟。实际上，原生音频处理的一个简单效果是它能理解生僻词或单词的发音。比如一个虽然现在不那么生僻但仍属罕见的词或名字是Demis Osabis。过去那些不能直接原生理解音频的系统常常以为我说的是Damascus，但现在它知道是Demis Osabis，并且能利用上下文来解析。就像，DeepMind的CEO是Demis Osabis。

Latency. So, actually, it's a simple effect that's possible with native audio is it can understand rare words or the pronunciation of words. A rare word, although becoming not so rare anymore is or name is. The old systems that didn't understand audio natively directly often thought I was saying Damascus, but now it knows that it's Demis Osabis, and it can use context to resolve that. Like, the CEO of DeepMind is Demis Osabis.

Speaker 1

最近有人发现的另一个例子，我们有一个小演示，是区分scone和scone这两个词，它们是同一种饼干状食物的两种发音。Project Astra实际上可以让你问，你知道scone和scone有什么区别？它会听到你说了不同的词，而不仅仅是转录成同一个词。最后一点是Kheem在所谓的端点检测方面做了很多出色的工作，这是一个非常技术性的术语，但大致上它能精确知道您何时停止说话。所以它非常擅长感知，比如，好了，用户真的说完了，现在我可以说话了。

Another example that that somebody found recently, which we have a little demo of, is distinguishing between the word scone and the word scone, which are two pronunciations of the the same biscuity thing. Project Astra can actually you say, what's the different you know, what's the difference between scone and scone? And it will have heard that you said a different word rather than just transcribing it into the one word. Then the final one is that Kheem did a lot of great work on what's called end pointing, which is a very technical term, but more or less it knows exactly when you have stopped speaking. So it's very good at sensing like, okay, the user's really done now so I can talk.

Speaker 1

然后还有更复杂的一点是，它甚至在您还没说完的时候就开始规划回应。哦。它有点像是在进行推测性规划。所以它有点像在猜测，这就是我会说的内容，你知道，然后当它确定用户真的说完了，就立刻发送出去。所以它已经完成了。

Then there's something even more sophisticated which is that it plans a response even if you hadn't finished speaking yet. Oh. And it sort of, it's speculatively planning that. So it's sort of guessing, this is what I would say, you know, and then when it figures out that the user really has finished speaking, then it just sends it right off. So it's already done it.

Speaker 1

它甚至在您可能自己都还没完全意识到已经说完之前，就已经想好要说什么了。

It's already figured out what to say before it's before maybe even you really know that you're done speaking.

Speaker 0

这太有趣了。因为我猜实际上很多时候人们的句子，他们句子的重要部分，你知道，可能是在中间，然后 towards the end 他们有点拖沓，你可以利用那段时间来准备你的答案。

That is so interesting. Because I guess actually a lot of the time people's sentences, the important bit of their sentences, you know, can can be in the middle, and then they sort of trail off towards the end, and you can use that time for getting ready with your answer.

Speaker 1

是的。差不多就是这样。是的。哦，是的。我们实际上讨论过这些东西，我们三年前就讨论过那些内容，然后当时觉得，那太过了，然后它就开始在这方面奏效了

Yeah. Pretty much that. Yeah. Oh, yeah. We we talked about this stuff actually, we talked about that stuff three three years ago, and then it seemed like, that's too much, and then it kind of started to work this

Speaker 0

今年。在对话还没进行到那一步之前，就预先猜测答案会是什么。

year. Preemptively guessing what the answer's gonna be before the conversation has has got to that point.

Speaker 1

是的。而且这很难，你知道。我们在句子中会停顿很长时间。对吧？所以我们现有的系统实际上必须使用一些所谓的语义理解，因为它对上下文也有一定的理解，并且它也在通过声音来猜测用户可能什么时候说完了。

Yeah. And it's hard, you know. We we pause for a long time in our sentences. Right? So the system that we have actually has to use some, quote, semantic understanding since it also has a bit of understanding of context and the sound is it's it's also hearing to guess when the user's probably done.

Speaker 0

但它所做的推理呢？我的意思是，即使抛开推理不谈，光是判断它是否完成了一个句子。你认为Astra项目具备推理能力吗？

But also the reasoning that it's doing. I mean, even separate from reasoning, whether it's finished a sentence or not. Do you think that project Astra is capable of reasoning?

Speaker 1

是的。它主要通过神经网络内部结构以一种不可观察或非常复杂的方式进行推理。然后还有它产生的对话本身。所以它有时会通过对话进行推理，你可以听到它在推敲答案。人们也在开发具有内部独白功能的系统，它们实际上是在自言自语而不与你交流。

Yeah. It's primarily reasoning through its internal structure inside the neural network sort of in an unobservable or sort of like a very complex way. Then there is the dialogue itself it's producing. So it sometimes reasons through the dialogue so you can kind of hear it sounding out an answer. People are also developing systems that have inner inner speech effectively where they're talking to themselves without talking to you.

Speaker 1

目前Astra项目在这方面做得不多。

Project Astrid at the moment doesn't do much of that.

Speaker 0

但我猜推理模型的进步不一定与Astra项目的发展截然分开。就像，我认为整个重点在于它整合了一切，从而形成了你所说的那种终极原型AGI。

But then I guess the advances that happen in the reasoning models are not need not be sort of, like, distinct from what happens in Project Astra. Like, I guess, I guess the whole point of this is that it's pulling in everything so that you have this, like, ultimate proto AGI as you called it.

Speaker 1

是的。在某种程度上，我实际上希望这能激励人们在推理的某些方面进行更积极的工作。我们有个很好的例子，Astra产品经理Bibo Xu有天午餐时拿出Astra问道：我的盘子里有多少卡路里？嗯。她有一个非常复杂、摆放精美的盘子，有六种食物，中间有些杏仁，那边有块猪里脊，还有些抱子甘蓝之类的。

Yeah. And I in some ways, I actually hope that it motivates some maybe more vigorous work on some aspects of reasoning. So we have this great example of Bibo Xu, the product manager on Product Astra, pulled out Astra one day at lunch and was like, how many calories are on my plate? Mhmm. She had a very complex, very beautifully laid out plate with, like, six types of food, like some almonds in the middle, like some a pork loin over there, like some, you know, brussels sprouts or whatever.

Speaker 1

然后它有点犹豫不决。但她说：保持持续计算，这些抱子甘蓝有多少卡路里？它就说：嗯，那是七个抱子甘蓝，所以是这么多卡路里。然后好吧。

And it was like, oh, you know, kind of waffled a little bit. But then she said, keep a running total, you know, and how many are on these, you know, brussels sprouts? And it was like, well, that's seven brussels sprouts. Therefore, it's this many this many calories. And then okay.

Speaker 1

现在加入猪里脊肉。对Old Eat Me来说，Bebo手动引导其思考这一点相当引人注目，正如你所说它有时需要一点指导，但我认为我们离一个系统自行判断并不遥远——比如它会说：我看到那边有七个煎蛋卷，这么多抱子甘蓝，还有一块猪里脊肉，所有这些加起来如何如何。所以在某种意义上，它不擅长这类事情，仅仅是因为我们从未尝试构建能够对此类事物进行推理的系统。

Now add the pork loin. One of the things that was quite notable to Old Eat Me was that Bebo was hand holding its, you know, thinking as you said it needs a little guidance sometimes, but I don't think that we're very far off from a system that itself would just say, well, I see there are seven omelets over there, this many brussels sprouts, there's a pork loin, all those together, such and such. So I think it's in some sense, it's not good at that stuff because we just have never tried to build a system that could reason about that stuff.

Speaker 0

现在我想更详细地和你聊聊记忆功能。关于它回忆和记住内容这一点，请原谅我拟人化的说法——我知道在Google I/O大会上，它只能记住过去45秒内发生的事情。而现在你们延长了这个时间。

Now I wanna talk to you a bit more about memory. On that point about the things that it's sort of recalling and keeping in its mind, it were, if you'll forgive the anthropomorphism. I know that back at Google IO, this could remember what happened in the last forty five seconds. Right? And now you've increased that time.

Speaker 0

现在能达到十分钟了，对吧？

You can do ten minutes now. Right?

Speaker 1

是的，大约十分钟。实际上在某些方面还能更长些，但十分钟可能是它按十分制应该达到的标准。

Yeah. It's about ten minutes. Yeah. Actually, it's a little bit longer in some ways, but ten minutes is maybe what it should do on the ten.

Speaker 0

是什么因素将上限设定为十分钟？

What makes ten minutes the limit?

Speaker 1

它本质上会原始记录最近十分钟的视频数据。系统以每秒一帧的速率工作，因此会按时间顺序存储所有帧图像以及这些帧之间输入的音频数据，大约保存最近600帧的内容。限制因素主要在于芯片的内存容量——我认为这种高速活动内存在过去十年左右并没有显著提升。

Yeah. So it's got basically a raw record of the last ten minutes of video. It works at about one frame per second, so it's got basically a stack of all the frames over time and all the audio that came in in between those frames for the last 600 frames or something like that. The limits are really about the memory on the chips, I think. That hasn't scaled very much, I think, in the last decade or something like that, the amount of this sort of fast active memory.

Speaker 0

那么目前它实际上就像是台录像机，确切记录着过去十分钟内发生的一切。

But so at the moment then it is effectively acting like a video recorder as it were, keeping an actual record of everything that's happened in the previous ten minutes.

Speaker 1

是的，它相当活跃，能够立即使用这些信息。还有一个次要系统，当你关闭系统时，它会将对话进行总结并提取出相关事实。

Yeah. It's quite active. It's able to use that information right away. There's also a sort of a secondary system, which is when you shut the the the turn the system off, it will then take that conversation and summarize it and pull out relevant facts.

Speaker 0

最重要的部分？

The most important bits?

Speaker 1

是的，它会自行判断哪些是重要的。

Yeah. And it uses its own discretion to figure out what that is.

Speaker 0

就像是提取对话的精髓，对吧？但就目前而言，它能记住最近互动中的重要事情吗？

To, like, extract the gist of it as it were. Yeah. But at the moment, I mean, one thing that it can do is recall important things from recent interactions, can it?

Speaker 1

是的，它有一种双流记忆机制。一方面是关于你个人的记忆，它逐渐了解你，就像在做笔记一样，比如‘哦，他们喜欢冰淇淋’。

Yeah. It's kind of got sort of a two stream memory. So it's got a memory that is both about you as a person. It's got a kind of a developing kind of an understanding of you. It's got it's effectively taking notes like, oh, they like ice cream.

Speaker 1

比如巧克力冰淇淋之类的，你懂吧？这些会形成一个关于你的发现列表，而且每次会话后都会更新。假设你说，‘你知道吗？’

That's chocolate ice cream or whatever. You know? That'll be, like, a kind of a list of things that's discovered about you. And that's actually updated after every session too. So suppose you say, you know what?

Speaker 1

‘我其实决定不再喜欢冰淇淋了，我现在很喜欢蛋糕，所以忘掉我喜欢冰淇淋这件事吧。’然后它会记录，‘用户表示不再喜欢冰淇淋，改为喜欢蛋糕。’这些信息就像是关于你是谁、你喜欢什么的静态理解或偏好。此外，还有对话摘要，比如‘周二08:50，我们讨论了这盘国际象棋。’

I actually have decided that I don't like ice cream anymore. I really like cake, so, like, forget that I liked ice cream. It will then say, user, you know, says they no longer like ice cream and they like cake. And those things are, like, kind of a stationary or static understanding of who you are effectively or what you're you like, your preferences. Then there's also, like, this kind of conversational summary that's like, Tuesday at 08:50, like, we talked about this game of chess.

Speaker 0

但它如何决定哪些信息该存入哪个部分呢？比如，它如何判断什么信息重要到值得作为关于你的记忆点来记住？

But then how does it decide which bit goes in which? Like, how does it decide what's important enough to to be a thing about you that it remembers?

Speaker 1

所以它采用了启发式规则。这些系统实际上被赋予了启发式规则。启发式本质上是一种关于该记住什么的经验法则。它使用的一个启发式规则是——我们也这样告诉它——如果你要求它记住某件事，它就应该记住。这是相当明确的一条。

So it's got heuristics. These systems actually are given heuristics. So a heuristic is basically a rule of thumb for what to remember. So one heuristic it uses is, and we've told it to, is if you ask it to remember something, it should definitely remember that. That's a pretty clear one.

Speaker 1

所以，你知道，如果我说，记住我的门禁密码，它就会照做。因为它会理解这是一个相关的指令。否则，它会做出最佳猜测。你知道，它有点像在判断：用户是否表达过任何有趣的偏好，或者与之前表达过的偏好不同的偏好？然后它会据此进行更新。

So, you know, if I say, you know, remember my door code, like, it will do that. Because it will understand that's an instruction of relevance. Otherwise, it takes a best guess. You know, it's sort of saying, has the user expressed any preferences that are interesting or that are different from the ones that it that the user has already expressed? And then it'll kind of update based on that.

Speaker 0

那么，我们来谈谈这里的一些隐私担忧吧。你们如何缓解这些隐私方面的顾虑？

Well, let's talk about some of the privacy concerns here then. How do you mitigate against some of those privacy concerns?

Speaker 1

没错。我认为一个主要标准是同意原则。用户可以访问他们先前被记录的数据，并且可以删除或查看存储的内容。每次你删除某些内容时，它都会重新构建关于你的全部知识。哦。

Right. So I think one of the major standards is that of consent. The users have access to their previously recorded data and they can delete it or see what is stored. Every time you delete something, it reconstitutes its whole knowledge of you. Oh.

Speaker 1

它会重新经历一遍总结它所知道的关于你的事情的整个过程。

It goes through the whole process of summarizing things it knows about you anew.

Speaker 0

所以我想，答案就是用户最终对其所了解的内容——或者说关于他们自己的内容——拥有一定的控制权。是的。但实际上，在这个播客前几期，我们采访了DeepMind的伦理学家Yasson Gabriel，他非常出色。他向我们讲述了AI助手的伦理问题，以及应该如何塑造它们以考虑这些难题。他的工作对你们开发Astra产生了多大影响？

So the answer then, I guess, is is that the user ends up having some control over what it knows about it, about them rather. Yeah. But actually, so in this podcast, a few episodes ago, we got to talk to Yasson Gabriel, who's this ethicist at DeepMind, he's amazing. And he was telling us about the ethics of AI assistants and how they should be shaped in order to take into account lots of these difficult questions. How much has his work fed into what you've come up with with Astra?

Speaker 1

我们刚把他那份243页的报告输入Astra，Astra说，好的。我明白了。

We just fed his 243 page report into Astra, and Astra said, okay. I got it.

Speaker 0

你明白了吗？

Did you?

Speaker 1

不。我

No. I

Speaker 0

我希望我希望。我太。

wish I wish. I so.

Speaker 1

我是的。我想我们和Yasan谈了很多，也和他所在的团队做了很多工作，他们一直在研究模型和整个智能体，探索它在不同情况下可能会做什么。我们还与一些外部红队合作，他们可能先入为主的观念较少，可能会对系统进行更多不同类型的对抗性攻击。我们还有一层安全过滤器。我的意思是，这是为了防止用户受到伤害，或者，例如，如果你对它说某些话或给它看色情内容，它会触发这些过滤器，不会回应。

I yeah. I think we've spoken a lot with Yasan, and we've done a lot of work with a team that he's part of, and they've been investigating both the model and the agent as the whole, exploring kind of what it might do in different circumstances. Also working with some external red teamers who maybe have fewer preconceptions and might do more more different kinds of adversarial attacks on the system. We also have a layer of safety filters. I mean, this is for user harms or, for example, if you say certain things to it or show it pornography, for example, it will kind of trigger these filters and not respond to that.

Speaker 1

它也会在自己的发言中触发，所以它不能说某些话，尽管这些触发非常罕见。但我不知道。是的。我认为问题的范围相当广泛。幸运的是，我们还有一些时间来解决问题。

It'll also trigger on its own speech, so it can't say certain things, although they trigger very infrequently anyway. But I don't know. Yeah. I think that the range of issues is quite broad. Fortunately, we still have some time to to figure stuff out.

Speaker 0

好的。那么接下来几个月你们的主要优先事项是什么？你们将要重点开展哪些工作？

Okay. So what what then are your next priorities then over the next few months? What were the main things you're gonna be working on?

Speaker 1

我对一种叫做主动视频工作的技术非常感兴趣。也就是说，这个系统不仅能响应你的语音，还能持续地为你提供帮助。比如，这是视觉辅助解决盲人问题的一部分。当你四处走动时，由于看不见，它会提醒你，比如‘注意那边的桌子’，它能在行进中持续引导你。

I'm very interested in something called proactive video work. So that is to say a system that can not only respond when you speak, but can also help you in an ongoing sense. So, for example, that's part of the visual interpretive of the blind problem. So you're walking around, you can't see it, it will say, you know, oh, watch out for the table over there, you know. It it can guide you in an ongoing sense.

Speaker 1

我们还在做很多关于音频输出的工作，也就是所谓的全双工。这样它就能同时处理你听到的和说出的内容，虽然这可能会有点烦人，可能会打断你，但也让对话更自然。就像你说话时，我可能会说‘嗯，嗯’，这就是在同时听和说。

We're also doing a lot of work on more audio output, what's called full duplex. So it will process both you'll hear and speak at the same time, which could be potentially annoying. It could interrupt you. But but it's also a more natural conversation. As you're talking, you might I might say, uh-huh, uh-huh, you know, and that's listening and talking at the same time.

Speaker 1

正如你所说，这是语言的一部分，用于确认推理，更深的记忆类型，某种反思，当它调用工具时，能够进行更深入的查询和研究。是的，有太多事情可以做得更好。

It's part of language to confirm more on reasoning, as you said, more deep kinds of memory, reflection of certain kinds when it calls tools to be able to do deeper inquiries and research with tools. Yeah. There's there's just so many things to do better.

Speaker 0

非常感谢你加入我们，Greg。

Well, thank you very much for joining us, Greg.

Speaker 1

谢谢你，Anna。

Thank you, Anna.

Speaker 0

奇怪的是，我们对AI的期望变化得如此之快。不知道你是否记得Oriol在上期节目中的话。他说，如果五年前有人告诉他这些可能实现的事情，他会认为我们已经走上了通往AGI的道路。然而，现在我们有了这个多模态代理的原型，它能看、能听、有记忆、上下文理解、推理能力，还能进行多语言实时对话。

It is strange how quickly our expectations change about AI. I don't know if you remember what Oriol said in our last episode. He said if someone had told him five years ago the things that would be possible, he would think that we were already on the path to AGI. And yet, here we have this prototype of a multimodal agent. It's one that can see, that can hear, that has memory and context and reasoning and multilingual real time conversation.

Speaker 0

这个代理至少在理论上可以陪伴你的日常生活，增强你的知识，支持残障人士，并提升我们的技能。当然，它还不是AGI，但感觉我们相比两年前讨论的那些系统，确实迈出了一大步。非常感谢你加入我们这期Google DeepMind播客系列。我们将在此稍作休息。

This is an agent that could, at least in theory, accompany you on your day to day, enhancing your knowledge, supporting people with disabilities and augmenting our skills. Now, of course, AGI, it isn't. But it definitely feels like we are a significant leap from the kinds of systems that we were talking about even two years ago. Thank you so much for joining us for this series of Google DeepMind the podcast. We are gonna take a break from here.

Speaker 0

但如果你想回顾我们之前的任何节目，我们的往期目录中有大量令人愉悦的极客式AI对话内容供你欣赏。只需在YouTube或你获取播客的任何地方找到它们即可。

But if you want to catch up on any of our previous episodes, then there is a whole array of deliciously nerdy AI conversational delights in our back catalog for you to enjoy. Just find them on YouTube or wherever you get your podcasts.