本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
为什么生成式AI已经吸收了全世界的知识,却仍无法自主做出科学发现?
Why has generative AI ingested all the world's knowledge but not been able to come up with scientific discoveries of its own?
它是否终于开始理解物理世界了?
And is it finally starting to understand the physical world?
我们将与Meta首席AI科学家、图灵奖得主杨立昆(Yann LeCun)一起探讨这个问题。
We'll discuss it with Meta Chief AI Scientist and Turing Award Winner, Jan Lecun.
杨立昆。
Lecun.
财政负责、金融天才、理财魔术师——这些都是人们对那些转投Progressive车险并省下数百美元的车主的评价。
Fiscally responsible, financial geniuses, monetary magicians these are things people say about drivers who switch their car insurance to Progressive and save hundreds.
因为Progressive为全额付款、拥有住房等提供多种折扣。
Because Progressive offers discounts for paying in full, owning a home, and more.
此外,您可以信赖他们优质的客户服务,在需要时提供帮助,让您的每一分钱都物有所值。
Plus, you can count on their great customer service to help when you need it, so your dollar goes a long way.
访问progressive.com,看看您是否能在车险上省钱。
Visit progressive.com to see if you could save on car insurance.
Progressive意外保险公司及其附属公司,实际节省金额会有所不同,并非在所有州或情况下都适用。
Progressive Casualty Insurance Company and affiliates, potential savings will vary, not available in all states or situations.
我是迈克尔·刘易斯。
Michael Lewis here.
我的畅销书《大空头》讲述了2008年美国房地产市场泡沫形成与破裂的故事。
My bestselling book, The Big Short, tells the story of the buildup and birth of The US housing market back in 2008.
十年前,《大空头》被改编成奥斯卡获奖电影,现在首次以有声书形式呈现,由尤鲁斯·特鲁里倾情演绎。
A decade ago, The Big Short was made into an Academy Award winning movie, and now I'm bringing it to you for the first time as an audiobook narrated by Eurus Truly.
《大空头》的故事——做空市场的含义,以及谁真正为失控的金融体系买单——其现实意义至今丝毫未减。
The Big Short story, what it means to bet against the market, and who really pays for an unchecked financial system is as relevant today as it's ever been.
立即在pushkin.fm/audiobooks或任何有声书销售平台获取《大空头》。
Get The Big Short now at pushkin.fm/audiobooks or wherever audiobooks are sold.
欢迎收听《大科技》播客,这是一档以冷静视角探讨科技界及其他领域微妙话题的节目。
Welcome to Big Technology Podcast, a show for cool headed, nuanced conversation of the tech world and beyond.
我是亚历克斯·坎特罗维茨,非常荣幸邀请到Meta首席AI科学家、图灵奖得主、被誉为AI教父的杨立昆做客《大科技》播客。
I'm Alex Kantrowitz, and I'm thrilled to welcome Jan Lecun, the chief AI scientist, Turing Award winner, and a man known as the godfather of AI to Big Technology Podcast.
Jan,很高兴再次见到你。
Jan, great to see you again.
欢迎来到节目。
Welcome to the show.
很荣幸能来到这里。
Pleasure to be here.
让我们从一个关于科学发现的问题开始,为什么AI直到现在都没能实现这一点。
Let's start with a question about scientific discovery and why AI has not been able to come up with it until this point.
这个问题来自Dwarkesh Patel。
This is coming from Dwarkesh Patel.
他几个月前提过这个问题。
He asked it a couple months ago.
你怎么看待这个事实:生成式AI基本上已经记住了人类知识的全部内容,却至今未能做出任何能导致新发现的新关联?
Why do you make of the fact that AIs, generative AI, basically have the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to discovery?
而即便是一个智力中等的人记住这么多内容,也会注意到:哦,这个因素导致这种症状,那个因素导致那种症状,或许这里存在一种医疗解决方案。
Whereas if even a moderately intelligent person had this much stuff memorized, they would notice, oh, this thing causes this symptom, this other thing causes this symptom, there might be a medical cure here.
那么我们不应该对AI抱有这种期待吗?
So shouldn't we be expecting that type of stuff from AI?
嗯,从AI的角度来说,是的。
Well, from AI, yes.
但从大语言模型的角度来说,不是。
From large language models, no.
要知道,AI架构有多种类型。
You know, there's several types of AI architectures.
对吧?
Right?
突然之间,当我们谈论AI时,我们想到的都是聊天机器人。
And and all of a sudden, when we talk about AI, we imagine chatbots.
聊天机器人、大语言模型是基于海量纯文本知识训练的。
Chatbots, LLMs, are trained on an enormous amount of of knowledge, which is purely text.
它们的训练本质上是复述、检索,最终生成符合其训练文本统计规律的答案。
And they're trained to basically regurgitate, to retrieve, to essentially produce answers that are conformed to the statistics of whatever text they've been trained on.
你能用它们做到的事情令人惊叹。
And it's amazing what you can do with them.
它们非常有用。
It's very useful.
这一点毋庸置疑。
There's no question about it.
我们也知道它们可能会编造不真实的事实,但在最纯粹的形式下,它们无法发明新事物。
We also know that they can hallucinate facts that aren't true, but they're really in their purest form, they are incapable of inventing new things.
让我分享一下Hugging Face的Tom Wolf过去一周在LinkedIn上提出的观点。
Let me throw out this perspective that Tom Wolf from Hugging Face shared on LinkedIn over the past week.
我知道你参与了相关讨论。
I know you were involved in the discussion about it.
这非常有趣。
It's very interesting.
他说要在数据中心创造出一个爱因斯坦,我们需要的不仅是一个知道所有答案的系统,而是一个能提出别人未曾想到或敢于提出的问题的系统。
He says to create an Einstein in a data center, we don't just need a system that knows all the answers, but rather one that can ask questions nobody else has thought or dared to ask.
一个能写出‘如果所有人都错了会怎样?’的系统,尽管所有教科书、专家和常识都指向相反的方向。
One that writes, what if everyone is wrong about this when all textbooks, experts and common knowledge suggest otherwise?
有没有可能教会大语言模型做到这一点?
Is it possible to teach LLM to do that?
不可能。
No.
不可能。
No.
在目前的形式下还不行。
Not not in the current form.
我是说,任何能够做到这一点的AI形式都不会是大语言模型。
I mean and and whatever form of AI would be able to do that will not be LLMs.
它们可能会将LLM作为其中一个组件使用。
They might use LLM as one component.
LLM在生成文本方面非常有用。
LLMs are useful to produce text.
因此,在未来的人工智能系统中,我们可能会利用它们将抽象思维转化为语言。
So in the future AI systems, we might use them to turn abstract thoughts into language.
在人类大脑中,这是由一个名为布洛卡区的微小脑区完成的,就在这里。
In the human brain, that's done by a tiny little brain area right here called the Broca area.
它大约有这么大。
It's about this big.
那就是我们的语言处理模块。
That's our LL.
明白吗?
Okay?
但我们并非用语言思考。
But we don't think in language.
我们通过情境的心理表征进行思考。
We think in mental representations of a situation.
我们对所有思考对象都拥有心理模型。
We have mental models of everything we think about.
即使无法言语,我们也能思考,而这正是发生在这里。
We can think even if we can speak, and that takes place here.
那里就像是真正智能所在之处。
That's like where real intelligence is.
而这正是我们尚未通过LLM(大型语言模型)复现的部分。
And that's the part that we haven't reproduced certainly with LLM.
所以问题在于,我们最终能否开发出不仅能够回答已有问题,还能针对我们提出的问题给出全新解决方案的AI架构和系统?
So the question is, are we going to have eventually AI architectures, AI systems that are capable of not just answering questions that are already there, but solving giving new solutions to problems that we specify?
答案是肯定的,最终会实现,但不是通过现有的远程操作技术。
The answer is yes, eventually, not with current telehands.
那么接下来的问题是,它们是否能够提出自己的问题,比如判断哪些是值得回答的好问题?
And then the next question is, are they going to be able to ask their own questions, like figure out what are the good questions to answer?
而答案最终是肯定的。
And the answer is eventually yes.
所以在我们研发出具备这种能力的机器之前,还需要相当长的时间。
So that's going to take a while before we get machines that are capable of this.
就像人类,我们拥有所有这些特质。
Like, in humans, we have all the characteristics.
有些人拥有极佳的记忆力。
We have people who have extremely good memory.
他们能回忆起大量信息。
They can retrieve a lot of things.
他们积累了丰富的知识。
They have a lot of accumulated knowledge.
我们也有善于解决问题的人,对吧?
We have people who are problem solvers, right?
给他们一个问题,他们就能解决。
You give them a problem, they'll solve it.
我记得托马斯实际上就讨论过这类事情。
And I think Thomas was actually talking about this kind of stuff.
他说过,如果你在学校表现优异,那你就是个优秀的解题高手。
He said, like, if you're good at school, you're a good problem solver.
我们给你一个问题。
We give you a problem.
你能解决它。
You can solve it.
然后你在数学或物理或其他科目上取得好成绩。
And you score well in math or physics or whatever it is.
但在研究中,最困难的事情其实是提出正确的问题。
But then in research, the most difficult thing is to actually ask the right is to ask the good questions.
哪些是重要的问题?
What are the important questions?
不仅仅是解决问题。
It's not just solving the problem.
还包括提出正确的问题,以恰当的方式构建问题框架,从而获得新的见解。
It's also asking the right questions, kind of framing a problem, you know, in the right way so so you have kind of new insight.
然后接下来的步骤是,好吧,我需要将其转化为方程式或某种实用的模型。
And and then after that comes, Okay, I need to turn this into equations or into something practical, a model.
而这可能是与提出正确问题不同的技能。
And that may be a different skill from the one that asked the right questions.
解方程也可能是另一项不同的技能。
It might be a different skill also to solve equations.
编写方程的人不一定是解方程的人。
The people who write the equations are not necessarily the people who write them who solve them.
总有人记得一百年前的某本教科书里解决过类似的方程。
All the people who remember that there is some textbook from one hundred years ago where similar equations were solved.
这是三种不同的技能。
Those are three different skills.
所以LNMs在检索方面确实很出色。
So LNMs are really good at retrieval.
它们不擅长解决新问题,你知道的,为新的问题寻找新的解决方案。
They're not good at solving new problems, get know, finding new solutions to new problems.
它们可以检索现有解决方案,但在提出正确问题方面显然完全不擅长。
They can retrieve existing solutions, and they're certainly not good at all at asking the right questions.
对于初次接触的听众来说,LLMs(大型语言模型)是诸如ChatGPT中内置的GPT模型背后的技术。但让我问问你,Jan。
And for those tuning in and and learning about this for the first time LLMs is the technology behind things like the GPT model that's within baked within chat GPT but let let me ask you this Jan.
AI领域似乎已从标准LLMs转向了能够逐步推理的模型。我很好奇,能否通过给推理模型植入质疑指令的方式,来编程这种反直觉或异端思维?
So the AI field does seem to have moved from standard LLMs to LMs that can reason and go step by step and I'm curious can you program this this sort of counterintuitive or this heretical thinking by imbuing a reasoning model with an instruction to question its directives?
嗯,我们首先需要明确'推理'的真正含义。
Well, so we have to figure out what reasoning really mean.
明白吗?
Okay?
显然,目前所有人都在尝试让LLMs具备某种程度的推理能力,或许是为了验证它们生成的答案是否正确。
And there are you know, obviously, everyone is trying to get LLMs to reason to some extent to perhaps be able to check whether the answer they produce are correct.
当前人们解决这个问题的方式,基本上是在不彻底改变现有范式的前提下进行改良。
And the way people are approaching the problem at the moment is that they basically are trying to do this by modifying the current paradigm without completely changing it.
懂了吗?
Okay?
那么能否在LLM之上附加一些功能模块,使其具备某种初级的推理能力?
So can you bolt a couple of words on top of LLM so that you kind of have some primitive reasoning function.
这本质上就是许多推理系统正在做的事情。
And that's essentially what a lot of reasoning systems are doing.
让大语言模型看似具备推理能力的一个简单方法就是思维链,对吧?
One simple way of getting LLMs to kind of appear to reason is chain of thought, right?
你基本上就是让它们生成比实际需要更多的标记,希望在这个过程中它们会投入更多计算资源来回答你的问题。
So you basically tell them to generate more tokens than they really need to in the hope that in the process of generating those tokens, they're going to devote more computation to answering your question.
在某种程度上,这种方法出人意料地有效,但局限性很大。
And to some extent, that works surprisingly, but it's very limited.
你实际上无法从中获得真正的推理,无法从中获得真正的推理。
You don't actually get real reasoning out of this.
推理,至少在经典,至少在古典AI和许多领域中,涉及对潜在解决方案。
Reasoning, at least in classical AI and in in many domain, involves a search through a space of potential solutions.
所以你需要解决一个问题,
So you have a problem to solve.
你可以判断问题是否解决。
You can characterize whether the problem is solved or not.
所以你有一些方法来判断问题是否已解决。
So you have some way of telling whether the problem is solved.
然后你在解决方案空间中搜索,直到找到满足约束条件或被确认为解决方案的方案。
And then you search through a space of solutions for when that actually satisfies the constraints or is identified as being a solution.
这基本上就是你能想象到的最通用的推理形式。
And that's our that's kind of the most general form of reasoning you can imagine.
在大型语言模型中,完全不存在这种搜索机制。
There is no mechanism at all in LLMs for this search mechanism.
你需要做的就是在它之上附加这个功能。
What you have is you have to kind of bolt this on top of it.
对吧?
Right?
所以一种方法是让大语言模型生成大量、大量、大量的答案序列。
So one way to do this is you get an LLM to produce lots and lots and lots of sequences of answers.
对吧?
Right?
这些标记序列,你知道的,代表着答案。
Sequences of tokens, which, you know, represent answers.
然后你需要一个独立的系统来筛选出哪个是好的。
And And then you have a separate system that picks which one is good.
明白吗?
Okay?
这有点像写程序时随机生成指令——可能还稍微遵循语言语法——然后挨个检查哪些程序能实际运行。
This is a bit like writing a program by sort of randomly, more or less, generating instructions while maybe respecting the grammar of the language and then checking all of those programs for when that actually works.
这不是产生正确代码的好方法,效率非常低。
Not a good way, not a very efficient way of producing correct pieces of code.
这也不是好的推理方式。
It's not a good way of reasoning either.
所以这里存在一个大问题:人类或动物进行推理时,并不是在标记空间里操作的。
So a big issue there is that when humans or animals reason, we don't do it in token space.
换句话说,我们推理时不需要先生成一个表达解决方案的文本,然后再生成另一个,接着再生成一个。
In other words, when we reason, we don't have to generate a text that expresses our solution and then generate another one and then generate another one.
然后在其他方案中,我们选出那个好的。
And then among other ones, we produce pick the one that is good.
我们是在内部进行推理的,对吧?
We reason internally, right?
我们有一个情境的心理模型,并在头脑中对其进行操作。
We have a mental model of the situation, and we manipulate it in our head.
当我们规划一系列动作时,比如要造一张桌子什么的,我们会找到一个不错的解决方案。
And we find kind of a good solution when we plan a sequence of actions to, I don't know, build a table or something.
我们规划这一系列动作。
We plan the sequence of action.
我们在脑海中有一个相关的心理模型。
We have a mental model of that in our head.
这与语言毫无关系。
And this has nothing to do with language.
如果我告诉你,想象现在有一个立方体漂浮在我们面前。
So if I tell you, imagine a cube floating in front of us right now.
现在将这个立方体沿垂直轴旋转90度。
Now rotate that cube 90 degrees along a vertical axis.
你可以想象这个过程正在发生。
You can imagine this thing taking place.
而且你可以轻易地辨认出这是个立方体。
And you can readily observe that it's a cube.
如果我旋转它90度,它看起来会和最初的立方体一模一样,因为你脑海中有一个立方体的心理模型。
If I rotate it 90 degrees, it's going to look just like the cube that I started with because you have this mental model of a cube.
这种推理是在某种抽象的连续空间中进行的。
And that reasoning is in some abstract continuous space.
它不是通过文字进行的。
It's not in text.
它与语言或类似的东西无关。
It's not related to language or anything like that.
人类一直在做这样的事情。
And humans do this all the time.
动物也经常这样做。
Animals do this all the time.
而这正是我们目前还无法用机器复现的。
And this is what we yet cannot reproduce with with machines.
是啊。
Yeah.
这让我想起你在谈论思维链,以及它为何未能产生太多新颖见解。
It reminds me you're talking through a chain of thought and how it doesn't produce much novel insights.
当DeepSeek问世时,流传最广的截图之一就是有人向DeepSeek询问关于人类处境的新颖见解。当你阅读它的回答时,会发现这又是AI耍的一个聪明把戏——它似乎确实列举了各种关于人类的有趣观察,比如我们如何将仇恨和暴力倾向转化为合作而非竞争,从而促进社会发展。
And when DeepSeek came out, one of the big screenshots that was going around was someone asking DeepSeek for a novel insight on the human condition and as you read it it's one another one of these very like clever tricks the AI pulls because it does seem like it's running through all these different like very interesting observations about humans how we take our our hate, like, our violent side and we channel it towards cooperation instead of competition, and that helps us build more.
但随着你阅读整个思维链,你会觉得这就像是读了《人类简史》和其他几本书的拼凑产物。
And then you're like, as you read the chain of thought, you're like, this is kind of just like you read Sapiens and maybe some other books, and that's your chain of thought.
差不多是这样。
Pretty much.
没错。
Yeah.
我是说,没错,很多内容都是重复输出的。
I mean, yeah, a lot of it is regurgitation.
我现在要把后面讨论的一个部分提前来说,实际上就是训练标准大型语言模型即将碰壁的问题。之前如果你投入一定数量的数据和计算资源来训练这些模型,它们的性能提升是可以预测的。
I'm now gonna move a part of the conversation I had later closer up, which is the wall effectively is training standard large language models coming close to hitting a wall whereas before there was somewhat predictable returns if you put a number a certain amount of data and a certain amount of compute towards training these models you can make them predictably better.
在我们讨论的过程中,我感觉你认为这种情况最终将不再成立。
As we're talking, it seems to me like you believe that that is eventually not gonna be true.
嗯,我不确定是否该称之为'碰壁',但回报递减是肯定的。要知道,我们已经快耗尽用于训练这些LLM的自然文本数据了——它们目前已经用掉了大约10^13到10^14个token的训练量。
Well, I don't know if I would call it a wall, but it's certainly diminishing return in the sense that, you know, we've kind of run out of natural text data to train those those LLMs where they're already trained with, you know, on the order of, you know, 10 to the 13 or 10 to the 14 tokens.
那可真是不少。
That's a lot.
确实很多。
That's a lot.
而这就像是
And that's the It's like the
整个互联网。
whole Internet.
那是公开明的互联网内容。
That's that's the publicly available Internet.
然后,有些公司会授权使用非公开的内容。
And then, you know, some companies license content that is not publicly available.
还有讨论说要生成人工数据,然后雇佣成千上万的人来创造更多数据。
And then there is talks about, like, you know, generating artificial data and then hiring thousands of people to kinda, you know, generate more data.
知识,那些博士和教授们。
Knowledge, the PhDs and professors.
是啊。
Yeah.
但实际上,问题可能更简单,因为大多数系统甚至不理解基本逻辑。
But in fact, it could be even simpler than this because most of the systems actually don't understand basic logic, for example.
对吧?
Right?
所以在某种程度上,通过合成数据和雇佣更多人填补系统知识背景的漏洞,这方面进展会比较缓慢。
So so to some extent, there's going to be slow progress along those lines with synthetic data, with hiring more people to plug the holes in the sort of knowledge background of those systems.
但这存在收益递减的问题,对吧?
But it's diminishing return, right?
生成这些数据的成本正在急剧膨胀,而回报却不那么显著。
The costs are ballooning of generating that data, and the returns are not that great.
所以我们需要一个新的范式。
So we need a new paradigm.
我们需要一种新型系统架构,其核心能够进行搜索——寻找优质解决方案、验证方案的有效性、规划达成特定目标的行为序列,这正是实现真正自主智能系统所需的能力。
We need a new kind of architecture of systems that at the core are capable of those search and you know, searching for a good solution, checking whether that solution is good, planning for a sequence of actions to arrive at a particular goal, which is what you would need for an agentic system to really work.
每个人都在谈论代理系统。
Everybody is talking about agentic system.
除了基本上复述系统已经训练过的内容外,没人知道如何构建它们。
Nobody has any idea how to build them other than basically regurgitating plants that have the system has already been trained on.
明白吗?
Okay?
这就像计算机科学中的一切一样。
So it's like everything in computer science.
你可以在人工智能的有限范围内设计解决方案。
You can engineer a solution, which is limited in the context of AI.
你可以建立一个基于学习或海量数据检索的系统。
You can make a system that is based on learning or retrieval with enormous amounts of data.
但真正的复杂性在于如何构建一个能解决新问题而无需针对这些问题进行训练的系统。
But really, complex thing is how you build a system that can solve new problems without being trained to solve those problems.
我们具备这种能力。
We are capable of doing this.
动物也具备这种能力,能够应对新情况。
Animals are capable of doing this, facing a new situation.
我们可以未经训练就首次遇到时即时解决问题(零样本学习),或者极快地学会解决它。
We can either solve it zero shot without training ourselves to handle that situation, just the first time we encounter it, or we can learn to solve it extremely quickly.
例如,我们通过几十小时的练习就能学会驾驶。
So for example, we can learn to drive in a couple dozen hours of practice.
以至于经过二三十小时后,驾驶就变得近乎本能,成为一种潜意识行为。
And to the point that after twenty, thirty hours, it becomes kinda second nature, where this become kinda subconscious.
我们不会
We don't
甚至都不用想。
even even think about it.
你甚至都不用想。
You don't even think about it.
你可以
You can
说到系统一和系统二。
Speaking of system one, system two.
对吧?
Right?
没错。
That's right.
所以,你知道的,这让我想起几年前我们和丹尼尔·卡尼曼的那场讨论。
So, you know, the recalls the discussion we had with with Danny Kahneman a few years ago.
所以,当你第一次开车时,你的系统二完全处于激活状态。
So, the first time you drive, your system two is all present.
你必须用它来预想各种灾难场景之类的事情,对吧?
You have to use it to reimagine all kind of catastrophes, scenarios, and stuff like that, right?
你的全部注意力都集中在驾驶上。
Your full attention is devoted to driving.
但经过几个小时后,你就可以边开车边和人交谈了。
But then after a number of hours, you can talk to someone at the same time.
比如,你根本不需要思考它。
Like, you don't need to think about it.
这已经变成某种潜意识且或多或少自动化的行为。
It's become sort of subconscious and more or less automatic.
它已经变成系统一了。
It's become system one.
几乎每一项我们首次学习并完成的任务,都需要动用我们心智的全部力量。
And pretty much every task that we you know, learn that we accomplished the first time, we have to use the full power of our of our minds.
然后最终,如果我们重复足够多次,它们就会变得有点潜意识化。
And then, eventually, if we repeat them sufficiently many times, they get they get kind of subconscious.
我清晰地记得有一次在一个研讨会上,有位参与者是国际象棋大师,他同时与大约50人对弈,从一个对手转到另一个。
I have this vivid memory of once being in a workshop where one of the participants was a chess grandmaster, and he played a simultaneous game against, like, 50 of us, right, you know, going from from one person to another.
你知道,我在10步内就被将死了。
You know, I got wiped out in 10 turns.
那就是我棋艺太差了。
That's I'm terrible at chess.
对吧?
Right?
但他会来到我的棋桌前。
But so he would come, you know, come to my my table.
我有时间思考,因为他同时在与其他大约50个对手对弈。
You know, I had time to think about this because he, you know, he was playing the other 50 tables or something.
所以我就在棋盘前走了一步棋。
So I make my move in front of it.
他反应是:什么?
He goes like, what?
然后立刻下了一步棋
And then immediately plays.
所以他根本不需要思考
So he doesn't have to think about it.
我这个对手太弱了,他甚至不需要调用他的系统二
I was not a a challenging enough opponent that he had to actually call his system two.
他的系统一就足够打败我了
His system one was sufficient to beat me.
这说明当你熟悉某项任务并经过训练后,它会变得近乎潜意识
And what that tells you is that when you become familiar with the task and you train yourself, it kind of becomes subconscious.
但人类和许多动物的核心能力在于,面对新情况时能够思考,规划出一系列行动步骤来实现目标
But the essential ability of humans and many animals is that when you face a new situation, you you can think about it, figure out a sequence of actions, a course of action to accomplish a goal.
除了对世界运行的基本常识外,你甚至不需要了解太多具体情况
And you don't need to know much about about the situation other than your common knowledge of how the world works, basically.
这就是我们在AI系统中所缺失的。
That's what we're missing, okay with AI systems.
好吧,现在我必须打乱顺序了,因为你提到了一些非常有趣的话题,我们需要深入讨论。
And it's okay now now I really have to blow up the order here because you've said some very interesting things that we have to talk about.
你谈到大型语言模型(LLM)基本上已经触及收益递减的临界点,正是这些模型把我们带到了今天的位置,而现在我们需要新的范式。但在我看来,这个新范式似乎尚未出现——我知道你正在为此进行研究,我们稍后会探讨下一个新范式可能是什么——但这里存在一个现实的时间线问题,你不觉得吗?考虑到已经投入的巨额资金...
You talked about how basically LLMs have hit the point of diminishing returns, large language models, the things that have gotten us here and we need a new paradigm but it also seems to me that that new paradigm isn't here yet and I know you're working on the research for it and we're going to talk about that what the next new paradigm might be but there's a real timeline issue don't you think because I'm just thinking about the money that's been raised and put into this.
去年OpenAI获得了66亿美元融资,就在上周(或几周前),Anthropic又获得了35亿美元投资——而他们去年已经融资了40亿美元。
Last year 6,600,000,000.0 to OpenAI last week or a couple weeks ago another 3 and a half billion to Anthropic after they raised 4,000,000,000 last year.
埃隆·马斯克正在为打造Grok投入另一笔巨额资金。
Elon Musk is putting another small fortune into building Grok.
这些都是LLM优先的企业,他们并没有探索...或许OpenAI除外,但他们获得的66亿美元投资完全是因为ChatGPT的成功。
These are all LLM first companies they're not searching out the next I mean maybe OpenAI is but that 6,600,000,000.0 that they got was because of chat GPT.
那么这个领域将何去何从?
So where's this field gonna go?
如果这些资金都投入到一个已经触及收益递减、需要新范式才能突破的领域,这听起来确实是个严峻的问题。
Because if that money is being invested into something that is at the point of diminishing returns requiring a new paradigm to progress, that sounds like a real problem.
嗯,我是说,对这个新范式有一些想法。
Well, I mean, have some ideas about what this paradigm is.
困难在于,我是说,我们正在努力让它发挥作用。
Difficulty that I mean, what we're working on is trying to make it work.
而且,你知道,这并不简单,可能需要数年时间。
And it's, you know, it's not simple that they take take years.
所以问题是,我们正在思考、正在研究的这种新范式所带来的能力,是否能够及时实现以证明所有这些投资的合理性?
And so the question is, are the capabilities we're talking about perhaps through this new paradigm that we're thinking of, that we're working on, is it going to come quickly enough to justify all of this investment?
如果新范式未能及时出现,这些投资还能被证明是合理的吗?
And if it doesn't come quickly enough, is the investment still justified?
好的。
Okay.
首先可以明确的是,仅通过扩大大型语言模型的规模,我们无法实现人类水平的人工智能。
So the first thing you can say is we are not going to get to human level AI by just scaling up LLMs.
这根本不可能实现。
This is just not going to happen.
明白吗?
Okay?
这是你的观点。
That's your perspective.
绝无可能。
There's no way.
好吧。
Okay.
绝对不可能。
Absolutely no way.
无论你从我那些更激进的同事那里听到什么,未来两年内都不可能实现。
And and whatever you can hear from some of my more adventurous colleagues, it's not going to happen within the next two years.
说真的,绝对没门——请原谅我的粗话。
There's absolutely no way in hell to you know, pardon my French.
你知道,那种认为我们会在数据中心里培养出一个天才国度的想法,纯属胡扯。
The you know, the idea that we're we're gonna have, you know, a country of genius in a data center, that's complete BS.
对吧?
Right?
绝对不可能。
There's absolutely no way.
我们可能拥有的是那些经过海量数据训练的系统,任何理性人士提出的问题都能通过这些系统找到答案。
What we're gonna have maybe is systems that are trained on sufficiently large amounts of data that any question that any reasonable person may ask will find an answer through those systems.
这会让你感觉像是有一位博士坐在你身边,但实际上你身边并没有博士。
And it would feel like you have a PhD sitting next to you, but it's not a PhD you have next to you.
这是一个拥有庞大记忆和检索能力的系统,而非能针对新问题发明解决方案的系统——后者才是博士的真正价值。
It's a system with a gigantic memory and retrieval ability, not a system that can invent solutions to new problems, which is really what a PhD is.
明白吗?
Okay?
实际上
This is actually
这就是全部关键所在。
It's whole deal.
这与托马斯·沃尔夫提出的观点相关,即发明新事物需要一种大语言模型无法提供的技能和能力。
It's connected to this post that Tomav Wolff made that inventing new things requires a type of skill and abilities that you're not going to get from LLMs.
所以这个重大问题在于,当前进行的投资并非为了明天。
So this big question, which is the investment that is being done now is not done for tomorrow.
而是为了未来几年。
It's done for the next few years.
而大部分投资,至少从Meta的角度来看,是用于推理基础设施的投资。
And most of the investment, at least for from the Meta side, is investment in infrastructure for inference.
好的。
Okay.
那么让我们设想一下,到今年年底——这实际上是Meta的计划——通过智能眼镜、独立应用等各种方式,让Meta AI拥有10亿用户。
So let's imagine that by the end of the year, which is really the plan at Meta, we have 1,000,000,000 users of Meta AI through smart glasses, you know, a stand alone app and and whatever.
你需要为这些人提供服务,而且竞争非常激烈。
You gotta serve those people, and that's a lot of competition.
这就是为什么需要大量基础设施投资来实现规模化,并需要数月甚至数年的时间来逐步构建。
So that's why you need, you know, a lot of investment in infrastructure to be able to scale this up and, you know, build it up over months or or years.
所以你看,这就是大部分资金流向的地方,至少对于微软、谷歌和可能还有亚马逊这类公司来说是这样。
And so that, you know, that's where most of the money is going, at least on on, you know, on the side of companies like Microsoft and Google and potentially Amazon.
那么这本质上就是运营层面的问题了。
Then there is so this is just operations, essentially.
现在的问题是:即使没有范式转变,还会有十亿人定期使用这些产品吗?
Now is there going to be the market for 1,000,000,000 people using those things regularly even if there is no change of paradigm?
答案很可能是肯定的。
And the answer is probably yes.
因此即使新范式的革命在未来三年内没有到来,这套基础设施仍将被使用。
So even if the revolution of the new paradigm doesn't come within three years, this infrastructure is going to be used.
这一点几乎毋庸置疑。
There's there's very little question about that.
明白吗?
Okay?
所以这是个很好的投资,而且建立数据中心等配套设施需要很长时间,必须现在就开始准备并规划持续发展,最终这项投资会被证明是合理的。
So it's a good investment, and it takes so long to set up, you know, data centers and all that stuff that you need to to get started now and plan for, you know, progress to be continuous so that, you know, eventually the investment is is justified.
但你不得不这么做,对吧?因为如果你有资金却不投入,风险实在太大了。
But you can't afford not to do it, right, because because there would be too much of a of a risk to take if you have the cash.
但让我们回到你刚才说的。
But let's go back to what you said.
如今的技术仍然经过深思熟虑,但关于它是否会被使用一直存在疑问。
The stuff today is still deeply thought, and there have been questions about whether it's going to be used.
现在Meta正在押注消费者市场,对吧?
Now Meta is making this consumer bet, right?
消费者想要使用AI。
The consumers want to use the AI.
这很合理。
That makes sense.
OpenAI的ChatGPT拥有4亿用户。
OpenAI has 400,000,000 users of ChatGPT.
Meta有三四十亿用户。
Meta has three, four billion.
我是说,基本上,如果你有一部手机的话,
I mean, basically, if you have a phone Well,
你们有三十多亿用户,六亿Meta AI用户。
you three three point something billion users, 600,000,000 users of Meta AI.
对。
Right.
好的。
Okay.
所以比ChatGPT还多。
So more than ChatGPT.
是的。
Yeah.
但他们使用频率不如ChatGPT高,所以用户活跃度没那么强。
But they they but it's not used as much as ChatGPT so the users are not as intense.
不够活跃。
As active.
基本上,Meta能够触及十亿消费者用户的想法
Basically the idea that that Meta can get to a billion consumer users
是啊。
Yeah.
这看起来是合理的。
That seems reasonable.
但问题在于,很多投资都是基于这将对企业而不仅仅是消费应用有用的理念,而存在一个问题,就像我们一直在讨论的,它目前还不够好。
But the thing is a lot of this investment has been made with the idea that this will be useful to enterprises not just a consumer app and there's a problem because like we've been talking about it's not good enough yet.
你看看深度研究。
You look at deep research.
这是Bendik Devin刚刚提出的观点。
This is something Bendik Devin just brought up.
深度研究相当不错,但可能只能帮你完成95%的工作,剩下的5%可能会出现幻觉错误。
Deep research is pretty good but it might only get you 95% of the way there and maybe 5% of it hallucinates.
所以如果你有一份100页的研究报告,其中5%是错误的,而你不知道是哪5%,这就是个问题。同样地,如今所有企业都在试图弄清楚如何让AI(尤其是生成式AI和其他类型的AI)对他们有用,但只有10%到20%的概念验证能真正投入生产,因为要么成本太高,要么容易出错。
So if you have a 100 page research report and 5% of it is wrong and you don't know what 5% that's a problem and similarly in enterprises today all every enterprise is trying to figure out how to make AI useful to them generative AI useful to them and other types of AI but only 10% or 20% maybe of proof of concepts make it out the door into production because they're it's either too expensive or it's fallible.
那么,如果我们即将达到顶峰,你预计在预期技术会进一步改善的情况下,所有已投入的资源将会发生什么变化?
So if this is if we are getting to the top here what do you anticipate is gonna happen with with everything that's that that has been pushed in the anticipation that it is gonna get even better from here?
嗯,这又回到了时间线的问题上。
Well, so, again, it's a question of time line.
对吧?
Right?
这些系统何时才能变得足够可靠和智能,从而让部署变得更加容易?
When when are those systems going to become sufficiently reliable and intelligent so that the deployment is made easier?
但是,你知道,我的意思是,你描述的这种状况——在那些令人印象深刻的演示之外,实际部署可靠系统的地方,往往就是计算机技术和特别是AI应用容易出问题的地方。
But but, you know, I mean, this this situation you're describing that, you know, beyond the impressive demos, actually deploying systems that are reliable is where things tend to falter in in the use of computers and technologies and particularly AI.
这并不新鲜。
This is not new.
这基本上就是为什么我们十年前就有超级震撼的自动驾驶演示,但至今仍未能实现五级自动驾驶汽车的原因。
It's it's basically, you know, why we we had super impressive, you know, autonomous driving demos ten years ago, but we still don't have level five self driving cars.
对吧?
Right?
可以说,对汽车而言,真正的难点在于最后一英里。
It's the last mile that's really difficult, so to speak, for cars.
最后看
The last see
你刚才的举动。
what you did there.
最后那几个不是故意的。
The last few that was not deliberate.
最后几个百分点的可靠性才能使系统实用,以及如何将其与现有系统整合等等,如何让使用者效率更高,如果你愿意这么说,或者更可靠之类的。
Last few percent of reliability, which makes a system practical, and how you integrate it with sort of existing systems and blah, blah, blah, and how it makes users of it more efficient, if you want, or more reliable or whatever.
这才是难点所在。
That's where it's difficult.
这就是为什么如果我们回顾几年前IBM沃森的情况,明白吗?
And this is why if we take if you go back several years and we look what happened with IBM Watson, Okay?
沃森本应成为IBM大力推广的产品,通过让它学习医学知识并部署到每家医院来创造巨额收入。
So Watson was going to be the thing that IBM was going to push and generate tons of revenue by having Watson learn about medicine and then be deployed in every hospital.
这基本上是一次彻底的失败,最后被拆解出售,对吧?这让IBM损失惨重,包括CEO也付出了代价。
And it was basically a complete failure and was sold for parts, right, and cost a lot of money to IBM, including the CEO.
实际情况是,要在确保系统可靠、真正帮助人们且不伤害劳动力天然保守性的环境中部署这些系统,这才是最复杂的地方。
And what happens is that actually deploying those systems in situations where they are reliable and actually help people and don't hurt the natural conservatism of the labor force, this is where things become complicated.
我们现在看到的也是同样的情况。
We're seeing the same.
当前我们在部署AI系统时遇到的困难并非新鲜事。
The process we're seeing now with the difficulty of deploying AI system is not new.
这种情况在任何时代都绝对发生过。
It's happened absolutely at all times.
这也是为什么你们的一些听众可能年纪太小不记得,但在1980年代,特别是80年代初,曾掀起过一波围绕专家系统的人工智能热潮。
This is also why some of your listeners perhaps are too young to remember this, but there was a big wave of interest in AI in the 1980s, early 1980s, around expert systems.
1980年代最热门的工作本该是知识工程师,你的工作就是坐在专家身边,将专家的知识转化为规则和事实,然后输入推理引擎,使其能够推导出新事实、回答问题等等。
The hottest job in the 1980s was going to be a knowledge engineer, and your job was going to be to sit next to an expert and then turn the knowledge of the expert into rules and facts that would then be fed to an inference engine that would be able to derive new facts and answer questions and blah, blah, blah.
当时掀起了巨大的热潮。
Big wave of interest.
日本政府启动了一个名为'第五代计算机'的大型项目。
The Japanese government started a big program called Fifth Generation Computer.
硬件本应被设计来处理这些任务,诸如此类。
The hardware was going to be designed to actually take care of that and blah blah blah.
你知道,基本上大部分都失败了。
You know, mostly mostly a failure.
这股热潮大约在90年代中期就逐渐消退了。
There was kind of wave of interest kind of died in the mid-'90s about this.
少数公司取得了成功,但基本上仅限于那些能够将人类知识简化为一系列规则且经济上可行的特定应用领域。
And a few companies were successful, but basically for a narrow set of applications for which you could actually reduce human knowledge to a bunch of rules and for which it was economically feasible to do so.
但对整个社会和产业的广泛影响并未实现。
But the wide ranging impact on all of society and industry was just not there.
这就是AI一直存在的风险。
And so that's a danger of AI all the time.
我的意思是,信号很明确——即便抛开其他因素不谈,这些花哨的功能至少在信息检索方面确实扮演着重要角色。
I mean, the signals are clear that still, with all the bells and whistles actually play an important role, if nothing else, for information retrieval.
你知道,大多数公司都希望拥有一些内部专家,他们熟悉所有内部文件,这样任何员工都可以提出任何问题。
You know, most companies want to have some sort of internal experts that know all the internal documents so that any employee can ask any question.
我们在Meta就有一个这样的系统。
We have one at Meta.
它叫MetaMate。
It's called MetaMate.
真的很酷。
It's really cool.
非常实用。
It's very useful.
是啊。
Yeah.
我并不是说现代AI或现代生成式AI没有用处。
And I'm I'm not suggesting that AI is gonna that modern AI is not or modern generative AI is not useful.
我只是单纯在问,已经投入了大量资金期望这些东西能真正达到神级能力。
Or I'm I'm asking purely that there's been a lot of money that's been invested into expecting this stuff to effectively achieve God level capabilities.
展开剩余字幕(还有 341 条)
我们都在讨论这里可能存在收益递减的问题。
And we both are talking about how, like, there's, you know, potentially diminishing returns here.
然后如果出现你提到的那种时间线不匹配的情况会怎样。
And then what happens if there's that timeline mismatch, like you mentioned.
是的。
Yeah.
这是我关于这个问题的最后一个问题了,因为我们还有很多其他内容要讨论,但我觉得时间线不匹配可能对你个人有特殊意义。
And this is the last question I'll ask about it because I feel like we have so much else to cover but I feel like timeline mismatches that might be personal to you.
你和我九年前第一次交谈,现在想来真是疯狂,九年前你对AI应该如何构建就有想法,却连参加会议的席位都争取不到。后来随着算力的提升,这些想法开始奏效,整个AI领域便基于你与Bengeo和Hinton共同研究的理念腾飞而起。
You and I first spoke nine years ago which is crazy now, nine years ago and you know about how in the early days you had an idea for how AI should be structured and you couldn't even get a seat at the conferences and then eventually with the when the right amount of compute came around those ideas started working and then the entire AI field took off based off of your idea that you worked on with Bengeo and Hinton.
还有一群其他人。
And a bunch of others.
以及许多其他人,但为了效率起见,我们会说去查查资料吧。
And many others and but for the sake of efficiency, we will say go look it up.
就是在讨论那些时间线错配的情况,当AI领域出现过度炒作时刻,比如你刚提到的专家系统,它们并未如人们预期那样发展,AI领域就会进入所谓的'AI寒冬'。
Just talking about those mismatched timelines when there have been over hyped moments, in the AI field, maybe with expert systems that you were just talking about and they don't pan out the way that people expect, the AI field goes into what's called AI winter.
嗯,会有反弹的。
Well there's a backlash.
是的。
Yep.
没错。
Correct.
那么,如果我们正面临这种时间线错配的时刻,考虑到当前的投资规模、主要训练方法可能出现的收益递减,再加上股市似乎正在经历低迷,你是否担心现在可能会出现又一次的寒冬期?
And so if we're gonna if we are potentially approaching this moment of mismatched timelines do you fear that there could be another winter now given the amount of investment given the fact that there's going to be potentially diminishing returns with the main way of training these things and maybe we'll add in the fact that the market is is the stock market looks like it's going through a bit of a downturn right now.
这可能是我们讨论中第三重要的变量,0但必须纳入考量。
Now that's a variable probably the third most important variable of what we're talking about, but it has to factor.
所以,是的,我认为...我是说,时机问题确实存在,但正如我之前所说,如果我们认为仅0, 仅通过,,),, 但如果我们试图更深入一点,就像我之前说的,如果你认为我们仅仅通过训练更多数据和扩大LLMs就能达到人类水平的AI,那你就错了。
So, yeah, I I think I mean, there's certainly a question of timing there, but I think if we try to dig a little bit deeper, as I said before, if you think that we're going get to human level AI by just training on more data and scaling up LLMs, you're making a mistake.
所以如果你是投资者,投资了一家告诉你'我们仅通过训练更多数据和一些小技巧就能达到人类水平和博士水平AI的公司。
So if you're an investor and you invested in a company that told you we're going to get to human level AI and PhD level by just training on more data and with a few tricks.
我不知道你是否会用你的衬衫,, 但你可能不知道,,但可能不是一个好主意。
I don't know if you're going to use your shirt, but that was probably not a good idea.
然而,关于如何推进并构建能够实现所有智能动物和人类能力(而当前我们的系统尚不具备这些能力)的系统,已有一些构想。
However, there are ideas about how to go forward and have systems that are capable of doing what every intelligent animal and human are capable of doing and that currently our systems are not capable of doing.
我指的是理解物理世界、拥有持久记忆以及具备推理和规划的能力。
And I'm talking about understanding the physical world, having persistent memory, and being able to reason and plan.
这四个特性是必须具备的。
Those are the four characteristics that need to be there.
这就要求系统能够获取常识,能够从视频等自然传感器中学习,而不仅仅是依赖人类产生的文本数据。
And that require systems that can acquire common sense, that can learn from natural sensors like video as opposed to just text, just human produced data.
这是一个巨大的挑战。
And that's a big challenge.
事实上,我已经就此谈论多年,指出这正是挑战所在。
I mean, I've been talking about this for many years now and saying this is where the challenge is.
这是我们必须解决的问题。
This is what we have to figure out.
我和我的团队,以及一些听从我建议的人,正在这条道路上取得,例如开发 systems that can be trained to understand
And and my group and I have or people working with me and others who have listened to me are making progress along along this line of systems that can be trained to understand how the world works on video, for example, systems that can use mental models of how the physical world works to plan sequences of actions to arrive at a particular goal.
所以我们已经有这类系统的初步成果。
So we have kind of early results of these kind of systems.
DeepMind和其他一些大学的研究人员也在从事类似的工作。
And there are people at DeepMind working on similar things, and there are people in various universities working on this.
问题在于,何时这些研究才能从展示新架构能力的论文,发展为可大规模应用、能自主解决新问题的实用架构。
So the question is, when is this going to go from interesting research papers demonstrating a new capability with a new architecture to architectures at scale that are practical for a lot of applications and can find solutions to new problems without being trained to do it, etcetera.
这不会在未来三年内实现,但可能在三年到五年之间达成。
And it's not going to happen within the next three years, but it may happen between three to five years, something like that.
这与我们看到的投资增长趋势大致吻合。
And that kind of corresponds to the sort of ramp up that we see in investment.
这是第一点。
Now whether other so that's the first thing.
第二点重要的是,不会存在某个公司或团队能发明一个神奇的万能解决方案。
Now the second thing that's important is that there's not going to be one secret magic bullet that one company or one group of people is going to invent that is going to just solve the problem.
这将需要多种思路、大量努力,以及某些可能不被所有人认同的基本原则,其中一些方向最终可能会走入死胡同。
It's going to be a lot of different ideas, a lot of effort, some principles around which to base this that some people may not subscribe to and will go in a direction that will turn out to be a dead end.
所以不会出现一个分水岭,在那之前没有通用人工智能,之后突然就有了。
So there's not going to be, like, a day before which there is no AGI and after which we we have AGI.
这不会是一个标志性事件。
This is not going to be an event.
这将是一个持续的概念演进过程,随着时间的推移,规模会越来越大,性能会越来越好。
It's going to be continuous conceptual ideas that, as time goes by, are going to be made bigger and to scale and gonna work better.
而且它不会来自某个单一实体。
And it's not gonna come from a single entity.
它将来自全球范围内的整个研究社区。
It's gonna come from the entire research community across the world.
那些分享研究成果的团队会比封闭研究的团队发展得更快。
And the people who share their research are gonna move faster than the ones that don't.
所以如果你认为某个五人初创公司掌握了通用人工智能的秘密,并打算投资50亿美元给他们,那你就犯了大错。
And so if you think that there is some start up somewhere with five people who has discovered the secret of AGI and you should invest 5,000,000,000 in them, you're making a huge mistake.
你知道,简,首先我一直很享受我们的对话,因为总能得到一些真知灼见。我记得上次谈话后,我经常回想那次对话,想着'简是这样说的'。
You know, Jan, first of all, I always enjoy our conversations because we start to get some real answers, and I remember even from our last conversation, was just always looking back to that conversation saying, This is what Jan says.
这就是其他人都在说的。
This is what everybody else is saying.
我很确定这是基础点,这一直是正确的,我知道我们这次也会这样做。
I'm pretty sure that this is the grounding point, and that's been correct, and I know we're gonna do that with this one as well.
现在你为我铺垫了两个有趣的话题线索,我们将在接下来的对话中展开讨论。
And now you've set me up for two interesting threads that we're gonna pull out as we go on with our conversation.
第一是对物理学和现实世界的理解,第二是开源。
First is the understanding of physics and the real world, and the second is open source.
所以我们稍后回来再继续这个话题。
So we'll do that when we come back right after this.
你想吃得更好,但既没时间也没精力去实现。
You wanna eat better, but you have zero time and zero energy to make it happen.
Factor不需要你备餐或照着食谱做。
Factor doesn't ask you to meal prep or follow recipes.
它直接解决了整个问题。
It just removes the entire problem.
两分钟,真材实料,轻松搞定。
Two minutes, real food, done.
还记得你想做健康餐却点了披萨那次吗?
Remember that time where you wanted to cook healthy but ordered pizza?
你并不是健康饮食的失败者。
You're not failing at healthy eating.
你只是没法每天多挤出三小时而已。
You're failing at having three extra hours every night.
Factor的餐点由厨师现做、营养师设计,并直送到你家门口。
Factor is already made by chefs, designed by dietitians, and delivered to your door.
你只需加热两分钟就能享用。
You heat it for two minutes and eat.
内含优质蛋白、多彩蔬菜、全天然食材和健康脂肪——这些本该是你有时间就会自己做的。
Inside, there are lean proteins, colorful vegetables, whole food ingredients, healthy fats, the stuff you'd make if you had the time.
立即访问factormeals.com/bigtech50off,并使用优惠码bigtech50off,首单立享5折外加全年免费早餐。
Head to factormeals.com/bigtech 50 off and use code big tech 50 off to get 50% off your first factor box plus free breakfast for one year.
该优惠仅适用于使用该代码并购买符合条件的自动续订订阅的新Factor客户。
The offer is only valid for new factor customers with the code and qualifying auto renewing subscription purchase.
通过Factor让健康饮食变得更简单。
Make healthier eating easy with Factor.
现在我们再次欢迎Jan Lecun。
And we're back here with Jan Lecun.
他是Meta的首席AI科学家,也是图灵奖得主,我们非常荣幸能第三次邀请到他在我们的节目中。
He is the chief AI scientist at Meta and the Turing Award winner that we're thrilled to have on our show luckily for the third time.
我想和你聊聊物理学,Jan,因为在科技播客历史上有个著名时刻——当然这个‘著名’是对于我们听众而言的。
I wanna talk to you about physics, Jan, because there's sort of this famous moment in big technology podcast history, and I say famous with our listeners.
我不确定这个梗是否出圈了,但你当时让我问ChatGPT:如果我双手水平拿着一张纸,然后松开左手,会发生什么?
I don't know if it really extended beyond, but you had me write to chat GPT if I hold a paper horizontally with both hands and let it let the go let go of the paper with my left hand, what will happen?
我输入问题后,它煞有介事地写出一堆物理原理,说纸张会向左飘。我念答案时深信不疑,你却指出AI在胡编而我居然信了。
And I write it and it convincingly says like it writes the the physics will will happen and the paper will float towards your left hand and I read it out loud convinced and you're like that thing just hallucinated and you believed it.
事实确实如此。
That is what happened.
所以听着,已经过去两年了。
So listen it's been two years.
我今天又测试了ChatGPT。
I put the test to chatty PT today.
它说当你用左手松开纸张时,重力会使纸张左侧下坠,而仍被右手握住的右侧保持原位,这会产生一个支点效应,纸张会围绕右手握持点旋转。
It says when you let go of the paper with your left hand gravity will cause the left side of the paper to drop while the right side still held up but your right hand remains in place this creates a pivot effect where the paper rotates around the point where your right hand is halting it.
所以现在它答对了。
So now it gets it right.
它吸取了教训。
It learned the lesson.
你知道吗?
You know?
很有可能这是OpenAI雇人专门解决的问题,他们输入了这个问题和答案,系统通过微调掌握了正确答案。
It's quite possible that this you know, some someone hired by OpenAI to solve the problem was fed that question and sort of fed the answer, and the system was fine tuned with the answer.
我是说,显然你能想象这类问题有无限多种。
I mean, obviously, you can imagine an infinite number of such questions.
这就是所谓的NLM后训练变得昂贵之处——你需要覆盖多少这类问题,才能使系统基本涵盖人们可能提出的90%或95%的问题。
And this is where the so called post training of NLM becomes expensive, which is that how much coverage of all those type of questions do you have to do for the system to basically cover 90% or 95% or whatever percentage of all the questions that people may ask it.
但你知道,这里存在长尾效应,你无法训练系统回答所有可能的问题,因为本质上问题是无限的。
But there you know, it's there's a long tail, and there is no way you can train the system to answer all possible questions because there is an essentially infinite number of them.
而且系统无法回答的问题远比它能回答的问题多得多。
And and there is way more questions the system cannot answer that than questions it can it can answer.
你无法覆盖所有可能的训练问题集合,明白吗,训练集中的训练集。
You cannot cover the set of all possible training, you know, questions in the training set the training set.
对。
Right.
因为我想起我们上次谈话时你说过,像纸张这种物理行为如果没有被广泛记录在文本中,模型就不知道如何处理——除非文本中有相关描述,否则模型不会具备那种对现实世界的内在理解。
So because I think our conversation last time was saying, you you said that because these actions of, like, what's happening with the paper, if you let go of it with your hand has not been covered widely in text, the model won't really know how to handle it because unless it's been covered in text, the model won't have that understanding, won't have that inherent understanding of the real world.
我一度接受这个观点,但后来想,不如试试生成一些AI视频吧。
And I've kind of gone with that for a while, then I said, you know what, let's let's try to generate some AI videos.
我在AI视频中发现一个有趣现象:它们对物理世界运作方式存在某种理解。这让我想起九年前我们第一次见面时你说的——最难解决的问题之一就是问AI‘如果你把笔垂直放在桌上然后松手会发生什么?’
And one of the interesting things that I've seen with the AI videos is there is some understanding of how the physical world works there in a way that in our first meeting nine years ago you said one of the hardest things to do is you ask an AI what happens if you hold a pen vertically on a table and let go?
它会掉下来吗?
Will it fall?
这其中存在着难以置信的大量可能性变化,AI要理解这些极其困难,因为它本质上并不理解物理规律。
And there's like an unbelievable amount of permutations that can occur and it's very very difficult for the AI to figure that out because it just doesn't inherently understand physics.
但现在你看看像Sora这样的系统,你只要说‘给我看一个人坐在椅子上踢腿的视频’,就能得到这样的视频。
But now you go to something like Sora and you say show me a a video of a man sitting on a chair kicking his legs, and you can get that video.
视频里的人会坐在椅子上踢腿,腿部关节会正常弯曲,而不会出现腿从关节处脱落之类的情况。
And the person sits on the chair, and they kick their legs, and the legs, you know, don't fall out of their sockets or stuff they bend at the joints.
而且它们没有三条腿。
And they don't have three legs.
而且它们没有三条腿。.
And they don't have three legs.
那么这是否表明这些大型语言模型的能力有所提升??
So wouldn't that suggest an improvement of the capabilities here with these large large models?
不。
No.
为什么?
Why?
因为那些视频生成系统制作的视频里,你倒一杯酒时,酒会悬在空中、飞走或直接消失等等。
Because you still have those videos produced by those video generation system where, you know, you spill a glass of wine and and the wine, like, floats in the air or, like, flies off or disappears or whatever.
当然,针对每个具体场景,你总能收集更多相关数据来训练模型处理它。
So of course, for every specific situation, you can always collect more data for that situation and then train your model to handle it.
但这并非真正理解底层现实。
But that's not really understanding the underlying reality.
这只是用越来越庞大的数据量来弥补理解力的不足。
This is just compensating the lack of understanding by increasingly large amounts of data.
孩子们仅需极少数据就能理解重力这类简单概念。
Children understand simple concepts like gravity with a surprisingly small amount of data.
实际上可以做个有趣的计算,我之前也提到过。
So in fact, there is an interesting calculation you can do, which I've talked about previously before.
以典型的大语言模型为例,训练数据达3万亿token量级,约3×10¹³个token。
But if you take LLN, typical LLN trained on 30,000,000,000,000 tokens, something like that, right, three ten to the 13 tokens.
每个标记大约三个字节,所以大约是0.9乘以10的14次方个标记。
The token is about three bytes, so that's point nine ten to the 14 tokens.
我们四舍五入一下,就说是10的14次方个标记吧。
Let's say 10 to the 14 tokens to to round this up.
这些文本内容,我们任何人可能都需要大约四十万年的时间才能读完。
That text would take any of us probably some on the order of four hundred thousand years to read.
对吧?
Right?
没问题。
No problem.
每天读十二个小时的话。
At twelve hours a day.
明白吗?
Okay?
而一个四岁孩子醒着的时间总共才16,000小时。
Now a four year old has been awake a total of 16,000.
你可以乘以3600得到秒数,然后就能计算出通过视神经传递到视觉皮层的数据量。
You can multiply by 3,600 to give number of seconds, and then you can put a number on how much data has gotten to your visual cortex through the optic nerve.
每条视神经(我们共有两条)每秒大约传输1兆字节的数据。
Each optic nerve, we have two of them, carries about one megabyte per second roughly.
所以是每秒2兆字节乘以3600再乘以16000。
So it's two megabytes per second times 3,600 times 16,000.
这大约就是10的14次方字节。
And that's just about ten fourteen bytes.
因此四年间,一个孩子通过视觉或触觉接收的数据量,已经相当于最大的语言模型系统。
So in four years, a child has seen through vision or touch, for that matter as much data as the biggest LMS.
这清楚地表明,仅靠文本训练是无法达到人类水平的人工智能的。
And and it tells you clearly that we're not gonna get to human level AI by just training on text.
文本作为信息来源的丰富程度远远不够。
It's just not a rich enough source of information.
顺便说,16000小时的视频量其实并不算多。
And by the way, sixteen thousand hours is not that much video.
相当于30分钟的YouTube上传内容。
It's thirty minutes of YouTube uploads.
明白吗?
Okay?
嗯哼。
Mhmm.
我们可以相当轻松地获取这些数据。
We can get that pretty easily.
在九个月的时间里,婴儿已经接收了大约10的13次方字节的数据,这其实也不算多。
Now in nine months, Baby has seen, you know, let's say, 10 to the 13 bytes or something, which is not not much, again.
在这段时间里,婴儿基本上掌握了我们所知的所有直觉物理知识——动量守恒、重力、物体不会凭空消失、即使被藏起来也依然存在等概念。
And in that time, Baby has learned basically all of intuitive physics that we know about, conservation of momentum, gravity, conservation of momentum, the fact that objects don't spontaneously disappear, the fact that they still exist even if you hide them.
我的意思是,我们在生命最初几个月里就学会了关于世界的各种基础认知。
I mean, there's all kinds of stuff, very basic stuff that we learn about the world in the first few months of life.
这正是我们需要用机器复现的学习方式——理解世界中哪些是可能和不可能的,预判行为后果,从而规划行动序列以实现特定目标。
And this is what we need to reproduce with machine, this type of learning of figuring out what is possible and impossible in the world, what will result from an action you take so that you can plan a sequence of actions to arrive at a particular goal.
这就是世界模型的概念。
That's the idea of world model.
现在联系到你关于视频生成系统的问题,训练越来越好的视频生成系统是解决这个问题的正确方法吗?
And now connected with your question about video generation systems, is the right way to approach this problem to train better and better video generation systems?
我对这个问题的回答是绝对不行。
And my answer to this is absolutely no.
理解世界的问题并不需要通过像素级视频生成的解决方案。
The problem of understanding the world does not go through the solution to generating video at the pixel level.
明白吗?
Okay?
我不需要知道如果我拿起这杯水并打翻它,我无法完全预测水的具体流动路径、在桌上的形状变化以及会发出什么声音等等细节。
I don't need to know if I take this glass of this cup of water and I spill it, I cannot entirely predict you know, the exact path of that the water will will follow on the table and what shape it's gonna take and all that stuff, what noise it's going to make.
但在某个抽象层面上,我可以预测水会洒出来。
But at a certain level of abstraction, I can make a prediction that the water will spill.
明白吗?
Okay?
而且可能会弄湿我的手机和其他东西。
And it'll probably make my phone wet and everything.
所以我无法预测所有细节,但我可以在某种抽象层面上进行预测。
So I can't predict all the details, but I can predict at some level of abstraction.
我认为这是一个非常关键的概念:如果一个系统要学习理解世界并掌握其运作方式,它需要能够学习世界的抽象表征,从而做出这些预测。
And I think that's really a critical concept, the fact that if you want a system to be able to learn to comprehend the world and understand how the world works, it needs to be able to learn an abstract representation of the world that allows you to make those prediction.
这意味着这些架构将不具备生成能力。
And what that means is that those architectures will not be generative.
对。
Right.
稍后我想讨论你的解决方案,但首先——我们之间的对话怎么能没有演示呢?
And I wanna get to your solution here in a moment, but I just wanted to also like, what would a conversation between us be without a demo?
我想展示一下,做视频时我会把这个放上屏幕——这是段让我相当自豪的视频。
So I wanna just show you I'm gonna put this on the screen when we do the video but there's this is a video I was pretty proud of.
视频里有个家伙坐在椅子上踢腿,结果腿还连在身上,我当时就觉得:哇,这技术真的在进步!然后我又尝试让一辆车冲进干草堆.
I got this guy sitting on a chair kicking his legs out and the legs stay attached to his body and I was like alright this stuff is making real progress and then I said can I get a car going into a haystack and so it's two bales of haystacks and then a haystack magically emerges from the hood of a car that's stationary And I just said to myself, okay?
简又赢了。.
Jan wins again.
这辆车确实不错.
It's it's a nice car, though.
是的.
Yeah.
我的意思是,关键在于这些系统已经通过大量人类数据进行了微调,因为你知道,这就是人们所}
I mean, the thing is those systems have been fine tuned with a huge amount of data for humans because, you know, that's that's what people are asking.
大多数视频,它们要求执行。
Most videos, they ask to do.
因此有大量人类进行各种活动的数据来训练这些系统。
So there is a lot of data of humans doing various things to train those systems.
这就是为什么它对人类有效,但对于训练该系统的人未曾预料到的情况则无效。
So that's why it works for humans, but not for a situation that the the people training that system had not anticipated.
所以你说模型要理解现实世界就不能是生成式的。
So you said that the model can't be generative to be able to understand the real world.
没错。
That's right.
你正在研究一个叫vJeppa的东西。
You are working on something called vJeppa.
是Jepa。
Jepa.
Jepa。
Jepa.
对吧?
Right?
V代表视频。
V is the video.
你们还有用于图像的ijeppa。
You also have ijeppa for images.
对。
Right.
我们有适用于各种场景的Jepa。
That is We have Jepa for all kinds of stuff.
文本也是。
Text also.
还有文本。
And text.
那么请解释这将如何解决让机器能够抽象表示现实世界中发生的事情的问题。
So explain how that will solve the problem of being able to allow a machine to abstractly represent what is going on in the real world.
好的。
Okay.
近年来AI的成功,尤其是自然语言理解和聊天机器人,在某种程度上也包括计算机视觉,其关键在于自监督学习。
So what has made the success of AI and particularly natural language understanding and chatbot in the last few years, but also to some extent, computer vision is self supervised learning.
那么什么是自监督学习?
So what is self supervised learning?
就是获取一个输入——无论是图像、视频、文本还是其他——以某种方式破坏它,然后训练一个大型神经网络来重建它,基本上是恢复其未损坏的版本、未失真的版本,或者通过采取行动可能产生的转换版本。
It's take an input, be it an image, a video, a piece of text, whatever, corrupt it in some way, and train a big neural net to reconstruct it, basically recover the uncorrupted version of it or the undistorted version of it or a transformed version of it that would result from taking an action.
明白吗?
Okay?
也就是说,比如在文本处理中,取一段文字,删掉部分单词,然后训练一个大型神经网络来预测缺失的单词。
And, you know, that that would mean, for example, in the context of text, take a piece of text, remove some of the words, and then train some big neural net to predict the words that are missing.
对于图像处理,则去掉图像的某些部分,然后训练大型神经网络来恢复完整图像。
Take an image, remove some pieces of it, and then train a big neural net to recover the full image.
视频处理则是去掉片段,训练神经网络预测缺失内容。
Take a video, remove a piece of it, train a neural net to predict what's missing.
明白吗?
Okay?
所以大语言模型是这种方法的特例,你拿一段文本,训练系统去重现这段文本。
So LLMs are a special case of this where you you you take a a text, and you train the system to just reproduce the text.
你不需要破坏文本,因为系统的设计方式让它只能根据左侧的标记来预测文本中的某个特定单词或标记。
And you don't need to corrupt the text because the system is designed in such a way that to predict one particular word or token in the text, it can only look at the tokens that are to the left of it.
懂了吗?
Okay?
因此,实际上,系统的架构中硬编码了它无法通过现在和未来来预测现在这一事实。
So so in effect, the system has hardwired into its architecture the fact that it cannot look at the present and the future to predict the present.
它只能回顾过去。
It can only look at the past.
明白吗?
Okay?
但基本上,就是训练这个系统在其输出中重现其输入。
But basically, train that system to just reproduce its input on its output.
懂了吗?
Okay?
所以这种架构被称为因果架构。
So this kind of architecture is called a causal architecture.
这就是大型语言模型(LLM)的本质。
And this is what an LLM is, a large language model.
全世界所有的聊天机器人都是基于这个原理构建的。
That's what all the chatbots in the world are are based on.
取一段文本并训练系统,使其在输出中重现这段文本。
Take a piece of text and train the system to just reproduce that piece of text on its output.
而要预测某个特定单词时,它只能查看该单词左侧的内容。
And to predict a particular word, it can only look at the word to the to the left of it.
因此现在你得到的系统是:给定一段文本,能够预测该文本后续的单词。
And so now what you have is a system that, given a piece of text, can predict the word that that follows that text.
你可以将这个预测出的单词移入输入,然后预测第二个单词。
And you can take that that word that is predicted, shift it into the input, and then predict the second word.
嗯。
Mhmm.
将其移入输入,预测第三个单词。
Shift that into the input, predict the third word.
这被称为自回归预测。
That's called autoregressive prediction.
这并不是一个新概念。
It's not a new concept.
非常古老。
Very old.
所以,你看,自监督学习并不是为了完成某项特定任务而训练系统,而是为了捕捉数据的内在结构。
So, you know, self supervised learning does not train to do a particular does not train a system to accomplish a particular task other than capture the internal structure of the of the data.
它完全不需要人工标注。
It doesn't require any labeling by by a human.
明白吗?
Okay?
那么拿这两张图片来说。
So apply these two images.
取一张图片,遮掉其中一部分,比如若干个图像块,然后训练一个大型神经网络来重建缺失部分。
Take an image, mask a chunk of it, like a bunch of patches from it, if you want, and then train a big neural net to reconstruct what is missing.
现在就可以用系统学习到的图像内部表征作为后续下游任务的输入,比如图像识别、分割等等,虽然效果有限。
And now use the internal representation of the image learned by the system as input to a subsequent downstream task for, I don't know, image recognition, segmentation, whatever it is, it works to some extent, but not great.
所以我们在FAIR项目里就有一个这样的大型计划。
So there's a big project like this to do this at fair.
它被称为MAE(掩码自编码器)。
It's called MAE, masked autoencoder.
这是去噪自编码器的一种特例,而后者正是我提出自监督学习理念的通用框架基础。
It's a special case of denoising autoencoder, which itself is the sort of general framework from which I derive this idea of self supervised learning.
所以效果不是很好。
So it doesn't work so well.
还有各种方法,比如应用到视频领域——我研究这个已经快二十年了——可以截取视频片段,训练系统预测视频后续内容。
And there's various ways to you know, if you apply this to video, also, I've been working on this for almost twenty years now, take a video, show just a piece of the video, and then train the system to predict what's going happen next in the video.
所以这和文本处理是同样的思路,只是应用于视频。
So same idea as for text, but just for for video.
而且这种方法效果也不太好。
And that doesn't work very well either.
那么为什么它对文本有效而对视频无效呢?
And the reason it doesn't work why does it work for text and not for video, for example?
答案是预测文本中下一个单词要容易得多。
And the answer is it's easy to predict a word that comes after a text.
你无法准确预测特定文本后接哪个词,但可以生成类似词典中所有可能单词或所有可能标记的概率分布。
You cannot exactly predict which word follows a particular text, but you can produce something like a probability distribution over all the possible words in a dictionary, all the possible tokens.
可能的标记数量大约只有10万个左右。
There's only about 100,000 possible tokens.
所以你只需生成一个包含10万个不同数字的大向量,这些数字均为正数且总和为一。
So you just produce a big vector with 100,000 different numbers that are positive and sum to one.
好的。
Okay.
那么,对于视频中所有可能的帧或图像中所有缺失部分,你打算如何表示其概率分布呢?
Now what are you going to do to represent a probability distribution over all possible frames in a video or all possible missing parts of an image?
我们尚不知如何妥善处理这个问题。
We don't know how to do this properly.
事实上,在高维连续空间中表示分布在数学上是难以处理的。
In fact, it's mathematically intractable to represent distributions in high dimensional continuous spaces.
明白了吗?
Okay?
我们不知道如何以一种实用的方式来做这件事,如果你想的话。
We don't know how to do this in a kind of useful way, if you want.
而且我尝试了很长时间,你懂的,为视频做这件事。
And so and I've tried to, you know, do this for video for a long time.
因此,这就是为什么那些使用生成模型进行自监督学习的想法至今未能成功的原因。
And so that is the reason why those ideas of self supervised learning using generative models have failed so far.
这也是为什么,你知道,试图训练一个视频生成系统作为理解世界运行方式的方法。
And this is why, you know, using a you know, trying to train a video generation system as a way to understand to get a system to understand how the world works.
这就是为什么它无法成功。
That's why it's it can't succeed.
那么替代方案是什么?
So what's the alternative?
替代方案是一种非生成式架构,我们称之为JPA,即联合嵌入预测架构。
The alternative is something that is not a generative architecture, which we call JPA, so that means Joint Embedding Predictive Architecture.
我们知道这比尝试重建效果要好得多。
And we know this works much better than attempting to reconstruct.
因此,我们多年来在图像表征学习方面取得的实验成果表明:与其对图像进行损坏并尝试重建图像,我们采用原始完整图像及其损坏版本。
So we've had experimental results on learning good representations of images going back many years where instead of taking an image, corrupting it, and attempting to reconstruct this image, we take the original full image and the corrupted version.
我们将两者都通过神经网络处理。
We run them both through neural nets.
这些神经网络会生成这两张图像(原始图像和损坏图像)的表征。
Those neural nets produce representations of those two images, the initial one and the corrupted one.
然后我们训练另一个神经网络(预测器),从损坏图像的表征预测完整图像的表征。
And we train another neural net, a predictor, to predict the representation of the full image from the representation of the corrupted one.
嗯。
Mhmm.
好的。
Okay.
如果你成功训练了一个这种类型的系统,它并不是为了重建任何东西而训练的。
And if you train a system if you're successfully trained a system of this type, this this is not trained to reconstruct anything.
它只是被训练来学习一种表征,以便你可以在表征层内进行预测。
Just trained to learn a representation so that you can make prediction within the representation layer.
你必须确保表征层尽可能多地包含输入信息,这实际上正是难点所在。
And you have to make sure that the representation contains as much information as possible about the input, which is why it's difficult, actually.
这就是训练这类系统的困难之处。
That's the difficult part of training those systems.
因此这被称为JEPA(联合嵌入预测架构)。
So that's called a JEPA, Joint Embedding Predictive Architecture.
要训练系统学习图像的良好表征,这些联合嵌入架构比通过重建训练的生成式架构效果更好。
And to train a system to learn good representations of images, those joint embedding architectures work much better than the ones that are generative that are trained by reconstruction.
现在我们也有了适用于视频的版本。
And now we have a version that works for video, too.
我们选取一段视频。
So we take a video.
通过遮蔽大片段来破坏它。
We corrupt it by masking a big chunk of it.
将完整视频和被破坏的视频输入完全相同的编码器进行处理。
We run the full video and the corrupted one through encoders that are identical.
然后同时,我们训练一个预测器,从部分视频预测完整视频的表征。
And then and simultaneously, we train a predictor to predict the representation of the full video from the partial one.
当系统学习到的视频表征被输入到一个训练用于识别视频中发生的动作、或判断视频是否可能等任务的系统时,其表现实际上相当出色。
And the representation that the system learns of videos, when you feed it to a system that you train to tell you, for example, what action is taking place in the video or whether the video is possible or impossible or things like that, it actually works quite well.
这很酷。
That's cool.
所以它提供了那种抽象思维
So it gives that abstract thinking
是的。
Yeah.
在某种程度上。
In a way.
对。
Right.
而且我们有实验结果表明,这种联合嵌入训练——我们有好几种方法可以实现这一点。
And and we have experimental result that shows that this joint embedding training, we have several methods for doing this.
其中一种叫DINO,另一种叫VCREG,还有一种也叫VCREG,还有一种是IGPA,这是一种蒸馏方法。
There's one that's called DINO, another one that's called VCREG, another one that's called VCREG, another one that's called IGPA, which is sort of a distillation method.
所以我们有几种不同的方法来解决这个问题,但其中一种将引导出一个基本方案,为我们提供训练这些JPAW架构的通用方法。
And so we had several different ways to approach this, but one of those is going to lead to a recipe that basically gives us a general way of training those JPAW architectures.
好的。
Okay.
所以它不是生成的,因为系统并未被训练来重建输入的部分内容。
So it's not generated because the system is not trained to regenerate the part of the input.
它被训练用于生成输入的一种抽象表示。
It's trained to generate a representation, an abstract representation of the input.
这样你就能忽略输入中所有真正不可预测的细节。
And what that allows you to do is to ignore all the details about the input that are really not predictable.
比如,你把笔竖着放在桌上,松手时无法预测它会倒向哪个方向。
Like, you know, the pen that you put on the table vertically, and when you let it go, you cannot predict in which direction it's gonna fall.
但在某些抽象层面上,你可以说笔会倒下——它正在倒下。
But at some abstract level, you can say that the pen is gonna fall It's falling.
对,但不表示具体方向。
Right, without representing the direction.
这就是JetPa的核心思想。
So so that's the that's the idea of JetPa.
我们开始在一些系统上取得不错的成果。
And and we're starting to have, you know, good results on sort of having systems.
比如VJEPPA系统,就是在大量自然视频上训练的。
So the VJEPPA system, for example, is trained on natural lots of natural videos.
然后你可以给它展示不可能的视频,比如物体消失或变形的视频。
And then you can show it a video that's impossible, like a video where, for example, an object disappears or changes shape.
明白吗?
Okay?
你可以用游戏引擎之类的工具生成这种视频。
You can generate this with a game engine or something.
或者像球体滚动,滚到屏幕后停止这样的场景。
Or a situation where you have a ball rolling, and it rolls and it stops behind a screen.
然后屏幕降下来,球却不见了。
And then the screen comes down, and the ball is not there anymore.
对。
Right.
明白吗?
Okay?
就是类似这样的情况。
So things like this.
然后你测量系统的预测误差。
And you measure the prediction error of the system.
所以这个系统是被训练来预测的。
So the the system is trained to predict.
对吧?
Right?
不一定在时间上,而是基本上预测视频的连贯性。
And not necessarily in time, but but, like, basically, to predict, you know, the the the sort of coherence of the video.
因此你在给系统播放视频时测量预测误差
And so you you measure the prediction error as you show the the video to the system.
当出现不可能的情况时,预测误差就会飙升
And when something impossible occurs, the prediction error goes goes through the roof.
这样你就能检测系统是否整合了某些概念,比如知道什么是物理上可能的、什么是不可能的,而仅仅通过物理可能的自然视频训练
And so you can detect if the system has integrated some idea of what, you know, is possible physically or what's not possible, but just being trained with physically possible natural videos.
这确实非常有趣
So that that's really interesting.
这算是系统获得某种程度常识的第一个迹象
That's sort of the first hint that a system has acquired some level of common sense.
是的
And Yes.
我们也有所谓动作条件的这类系统版本
We have versions of those systems also that are so called action conditions.
基本上我们有一些视频片段或图像,展示时间t的世界状态,然后执行某个动作,比如机械臂移动之类的
So, basically, we have things where we have a chunk of video or an image of, you know, the state of the world at time t, and then an action is being taken like, you know, robot arm is being moved or whatever.
当然,我们随后可以观察这一行动产生的结果。
And then, of course, we can observe the the result resulting from this action.
所以当我们用这种方式训练JEPPA时,模型基本上可以说:这是时间t时的世界状态。
So now what we have when when we train a JEPPA with this, the the model basically can say, here is the state of the world at time t.
这是你可能采取的一个行动。
Here is an action you might take.
我可以预测在这个抽象空间表示下,时间t加一时世界的状态。
I can predict the state of the world at time t plus one in this abstract representation of space.
这是对世界运行方式的学习。
There's this learning of of how the world works.
关于世界如何运转。
Of how the world works.
而关于这个最酷的地方在于,现在你可以设想让系统去想象一系列动作会产生什么结果。
And and the cool thing about this is that now you can imagine you can have the system imagine what would be the outcome of a sequence of actions.
如果你给它一个目标,比如我希望世界最终呈现这种状态,你能找出一系列动作来帮我实现这个目标吗?
And if you give it a goal saying, like, I want the world to look like this at the end, can you figure out a sequence of actions to get me to that point?
它实际上可以通过搜索一系列能产生该结果的动作来解决问题。
It can actually figure out by search for a sequence of actions that will actually produce that result.
这就是规划。
That's planning.
这就是推理。
That's reasoning.
这是真正的推理和实际的规划。
That's actual reasoning and actual planning.
好的。
Okay.
Jan,我得让你说说,虽然我们时间有限,但你能用60秒谈谈对DeepSeek的看法吗?开源模型现在是否已经超越了专有模型?我们真的只能给你60秒。
Jan, I have to get you out here where we are over time, but can you give me, like, sixty seconds your reaction to DeepSeek and sort of has open source overtaken the propriety model proprietary models at this point, and we gotta limit to sixty seconds.
不然你的团队会杀了我的。
Otherwise, I'm gonna get killed by your team here.
所以
So
超越这个词有点重了。
Overtaken is a is a strong word.
我认为开源世界的进展更快。
I think progress is faster in the open source world.
这是肯定的。
That's for sure.
但当然,你知道,专有厂商也在从开源世界的进步中获利,对吧?
But, of course, you know, the propriety shops are profiting from the progress of the open source world, right?
他们和其他人一样都能获取这些信息。
They get access to that information like everybody else.
所以很明显的是,开源世界涌现出更多有趣的创意,任何一家公司无论规模多大都无法独自想到。
So what's clear is that there is many more interesting ideas coming out of the open source world that any single shop as big as it can be cannot come up with.
没有人能垄断好创意。
Nobody has a monopoly on good ideas.
因此开源世界的魔力在于它能汇聚全球的人才。
And so the magic efficiency of the open source world is that it recruits talents from all over the world.
因此我们在DeepSeek项目中看到,如果组建一个小型团队并给予较长研发周期和较少限制条件来开发下一代大语言模型,他们确实能提出前所未有的创新想法,对吧?
And so what we've seen with DeepSeek is that if you set up a small team with a relatively long leash and few constraints on coming up with just the next generation of LLMs, they can actually come up with new ideas that nobody else has come up with, right?
他们能在某种程度上重新定义做事方式
They can sort of reinvent a little bit what how you do things.
如果他们将这些成果与世界共享,整个人类文明都会进步
And then if they share that with the rest of the world, then the entire world progresses.
明白吗?
Okay?
因此这清楚地表明开源发展更快,开源世界能孕育更多创新,而专有领域可能难以追赶。
And so it clearly shows that open source progresses faster, and a lot more innovation can take place in the open source world, which the proprietary world may have a hard time catching up with.
它的运营成本更低。
It's cheaper to run.
我们看到的是,你知道,与我们交流的合作伙伴表示,他们的客户在原型设计阶段可能会使用专有API。
What we see is for, you know, partners who we talk to, they say, well, our clients, when they prototype something, they may use a proprietary API.
但当真正部署产品时,他们实际上会使用Lama或其他开源引擎,因为这样更经济且更安全。
But when it comes time to actually deploy the product, they actually use Lama or open or other open source engines because it's cheaper and it's more secure.
要知道,它更可控。
You know, it's more controllable.
你可以本地部署运行。
You can run it on premise.
明白吗,它有各种优势。
You know, there's all kinds of advantages.
所以我们看到一些人的观念也发生了很大转变,他们最初担心开源成果会...比如说帮助中国之类的,如果你有地缘政治方面的顾虑。
So we've seen also a big evolution in the thinking of some people who are initially worried that open source efforts were going to, I don't know, for example, help the Chinese or something, if you have some geopolitical reason to think it's a bad idea.
但DeepSea表明的事实是,中国人并不需要我们。
But what DeepSea has shown is that the Chinese don't need us.
我是说,他们自己能想出非常棒的点子。
I mean, they can come up with really good ideas.
对吧?
Right?
我们都知道中国有非常非常优秀的科学家。
I mean, we all know that there are really, really good scientists in China.
而鲜为人知的是,整个科学界被引用次数最多的论文是一篇十年前(2015年)关于深度学习的论文,它出自北京。
And one thing that is not widely known is that the single most cited paper in all of science is a paper on deep learning from ten years ago, from 2015, and it came out of Beijing.
哦。
Oh.
好的。
Okay.
这篇论文名为《ResNet》。
It's the paper is called ResNet.
这是一种特殊类型的神经网络架构,其基本设计理念是:深度学习系统的每个阶段默认都计算恒等函数。
So it's a particular type of architecture of neural net where basically by default, every stage in a deep learning system computes the identity function.
你只需将输入原样复制到输出端。
You just copy its input on its output.
而神经网络要做的是计算与这个恒等关系的偏差。
And what the neural net does is compute the deviation from this identity.
这使得训练具有数十层(甚至可能上百层)的超深度神经网络成为可能。
So that allows to train extremely deep neural net with dozens of layers, perhaps 100 layers.
这篇论文的第一作者是一位名叫何恺明的先生。
And he was the first author of that paper is a gentleman called Kamin He.
当时,他在微软亚洲研究院北京分部工作。
At the time, he was working at Microsoft Research Beijing.
嗯。
Mhmm.
在那篇论文发表后不久,他加入了加州FAIR(Facebook人工智能研究院),我聘请了他,他在FAIR工作了大约八年,最近离职,现在是麻省理工学院的教授。
Soon thereafter, the publication of that paper, he joined FARE in California, So I hired him and worked at FAIR for eight years or so and recently left and is now a professor at MIT.
所以世界各地确实都有非常优秀的科学家。
So there are really, really good scientists everywhere in the world.
没有人能垄断好创意。
Nobody has a monopoly on good ideas.
硅谷当然也没有好创意的垄断权。
Certainly Silicon Valley does not have a monopoly on good ideas.
另一个例子是,事实上第一个Lama模型出自巴黎。
Or another example of that is actually the first Lama came out of Paris.
它出自巴黎的Fair Lives团队,一个由12人组成的小团队。
It came out of the Fair Lives in Paris, a small team of 12 people.
因此,你必须利用全球各地多元化的思想、背景和创造力,如果你想推动科学与技术的快速发展,而这正是开源所促成的。
So you have to take advantage of the diversity of ideas, backgrounds, creative juices of the entire world if you want science and technology to progress fast, and that's enabled by by open source.
Jan,和你交谈总是很愉快。
Jan, it is always great to speak with you.
这是我们,我想,第四次或第五次交谈了,我想,这要追溯到九年前了。
This is our, I think, fourth or fifth time speaking again, going back nine years ago.
你总是帮助我看穿所有炒作和喧嚣,真正理解正在发生的事情。
You always helped me see through all the hype and the buzz and actually figure out what's happening.
我相信这对我们的听众和观众来说也会是如此。
I'm sure that's gonna be the case for our listeners and viewers as well.
所以,Yan,非常感谢你的到来。
So, Yan, thank you so much for coming on.
希望我们能很快再次合作。
Hopefully, we do it again soon.
谢谢你,Alex。
Thank you, Alex.
好的,各位。
Alright, everybody.
感谢大家的观看。
Thank you for watching.
我们将在周五回来解析本周的新闻。
We'll be back on Friday to break down the week's news.
在那之前,我们下次在《大科技播客》再见。
Until then, we'll see you next time on Big Technology Podcast.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。