关于萨顿访谈的一些思考

本集简介

我现在对萨顿的观点有了更深入的理解，想稍作反思。 (00:00:00) - 核心论点 (00:02:42) - 当前思考摘要 (00:03:22) - 模仿学习与强化学习的连续性和互补性 (00:08:26) - 持续学习 (00:10:31) - 总结思考完整内容请订阅Dwarkesh播客官网：www.dwarkesh.com/subscribe

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

天啊，你们在访谈中对这个话题真是见解颇多。我自己也一直在思考，现在对Sutton观点的理解比访谈时深入多了。所以我想反思下现在对他角色观点的理解。Richard，如果还有任何错误或误解，我在此致歉。从你们的思考中我获益良多。

Boy, do you guys have a lot of thoughts about this on an interview. I've been thinking about it myself and I think I have a much better understanding now of Sutton's perspective than I did during the interview itself. So I wanted to reflect on how I understand his role view now. And Richard, apologies if there's still any errors or misunderstandings. It's been very productive to learn from your thoughts.

Speaker 0

好的，以下是我对Richard立场'钢铁人'版本的理解。显然，他写过同题文章《苦涩的教训》。这篇文章讲什么？它并非主张要尽可能挥霍计算资源。

Okay. So here's my understanding of the steel man of Richard's position. Obviously, he wrote the same essay, The Bitter Lesson. And what is this essay about? Well, it's not saying that you just wanna throw away as much compute as you possibly can.

Speaker 0

《苦涩的教训》指出：我们需要开发能最高效、最具扩展性地利用计算资源的技术。大语言模型消耗的大部分算力都用在部署后的运行阶段。而在这整个过程中它什么都没学——学习只发生在我们称为训练的特殊阶段。这显然不是算力的有效利用。

The bitter lesson says that you want to come up with techniques which most effectively and scalably leverage compute. Most of the compute that's spent on an LLM is used in running it during deployment. And yet it's not learning anything during this entire period. It's only learning during the special phase that we call training. And so this is obviously not an effective use of compute.

Speaker 0

更糟的是，训练阶段本身效率极低——这些模型通常要消耗相当于数万年人类经验的算力。而且在整个训练阶段，它们的学习完全来自人类数据。对于免费训练数据这点显而易见，但即使是我们用LLM进行的RLVR（强化学习与人类反馈）也是如此——这些强化学习环境本质上是人类搭建的游乐场，用来教授我们预设的特定技能。智能体根本没有通过自主探索世界来学习。

And what's even worse is that this training period by itself is highly inefficient because these models are usually trained on the equivalent of tens of thousands of years of human experience. And what's more during this training phase, all of their learning is coming straight from human data. Now this is an obvious point in the case of free training data, but it's even kind of true for the RLVR that we do with these LLMs. These RL environments are human furnished playgrounds to teach LLMs the specific skills that we have prescribed for them. The agent is in no substantial way learning from organic and self directed engagement with the world.

Speaker 0

仅从人类数据学习——这种缺乏弹性且难以扩展的资源——不是可扩展的算力使用方式。更重要的是，这些LLM从训练中学到的并非真实世界模型（能告诉你环境如何随行动变化），而是在构建'人类接下来会说什么'的模型。这导致它们依赖人类衍生的概念。举例来说，假设你用1900年前的所有数据训练LLM——

Having to learn only from human data, which is an inelastic and hard to scale resource is not a scalable way to use compute. Furthermore, what these LLMs learn from training is not a true world model, which would tell you how the environment changes in response to different actions that you take. Rather, they're building a model of what a human would say next. And this leads them to rely on human derived concepts. A way to think about this would be, suppose you trained an LLM on all the data up to the year 1900.

Speaker 0

这个LLM很可能无法从零推导出相对论。更深层的原因是：整个范式终将被取代——LLM无法在职学习。我们需要新架构来实现持续学习。一旦实现，就再也不需要专门的训练阶段了。

That LLM probably wouldn't be able to come up with relativity from scratch. And maybe here's a more fundamental reason to think this whole paradigm will eventually be superseded. LLMs aren't capable of learning on the job. So we'll need some new architecture to enable this kind of continual learning. And once we do have this architecture, we won't need a special trading phase.

Speaker 0

智能体将能像所有人类——事实上像所有动物那样实时学习。这种新范式将使我们当前超级样本低效的LLM训练方式彻底过时。以上是我对Rich立场的理解。我与Rich的主要分歧在于：我不认为他用来区分LLM与真正智能（或动物智能）的概念真是非此即彼的。比如，我认为模仿学习与强化学习是连续且互补的。

The agents will just be able to learn on the fly like all humans, and in fact, like all animals are able to do. And this new paradigm will render our current approach with LLMs and their special training phase that's super sample inefficient, totally obsolete. So that's my understanding of Rich's position. My main difference with Rich is just that I don't think the concepts he's using to distinguish LLMs from true intelligence or animal intelligence are actually that mutually exclusive or dichotomous. For example, I think imitation learning is continuous with and complimentary to RL.

Speaker 0

相关地，人类模型能提供一个先验，有助于学习真实的世界模型。如果未来某个版本的测试时微调能复制持续学习，我也不会感到惊讶，毕竟我们已经通过上下文学习在一定程度上实现了这一点。那么，先从我关于模仿学习与强化学习（RL）既连续又互补的主张开始。我曾多次询问理查德，预训练的大型语言模型是否能作为良好的先验基础，让我们在其上积累经验学习（即进行强化学习），从而通向通用人工智能（AGI）。几个月前，埃利亚斯·埃斯基维尔做了一个我认为极其有趣的演讲，他将预训练数据比作化石燃料。

And relatedly models of humans can give you a prior which facilitates learning true world models. I also wouldn't be surprised if some future version of test time fine tuning could replicate continual learning, given that we've already managed to accomplish this somewhat with in context learning. So let's start with my claim that imitation learning is continuous with and complimentary to RL. So I tried to ask Richard a couple of times whether pre trained LLMs can serve as a good prior on which we can accumulate the experiential learning AKA do the RL, which would lead to AGI. So Elias Eskiver gave a talk a couple of months ago that I thought was super interesting and he compared pre training data to fossil fuels.

Speaker 0

我认为这个类比实际上具有惊人的延伸性。化石燃料虽非可再生资源，但并不意味着我们的文明因使用它们而走上绝路。事实上，它们至关重要。你根本无法直接从1800年的水车跨越到太阳能板和核聚变发电站。我们必须利用这种廉价、便捷且储量丰富的中介资源才能迈入下一阶段。

And I think this analogy actually has remarkable reach. Just because fossil fuels are not a renewable resource does not mean that our civilization ended up on a dead end track by using them. In fact, they were absolutely crucial. You simply couldn't have transitioned from the water wheels of 1800 to solar panels and fusion power plants. We had to use this cheap, convenient, and plentiful intermediary to get to the next step.

Speaker 0

AlphaGo基于人类棋局训练，而AlphaZero从零开始自我进化，两者都是超越人类的围棋高手。当然，AlphaZero更强。于是问题来了：我们或首个AGI最终是否会发展出一种无需知识初始化、完全从零自我启发的通用学习技术？它能否超越当时最先进的人工智能？我认为这两个问题的答案很可能都是肯定的。

AlphaGo, which was conditioned on human games and AlphaZero, which was bootstrapped from scratch, were both superhuman Go players. Now, of course, AlphaZero was better. So you can ask the question, will we or will the first AGIs eventually come up with a general learning technique that requires no initialization of knowledge and that just bootstraps itself from the very start? And will it outperform the very best AIs that have been trained up to that date? I think the answer to both of these questions is probably yes.

Speaker 0

但这是否意味着模仿学习在开发首个AGI甚至首个超级人工智能（ASI）过程中毫无价值？并非如此。AlphaGo尽管最初受人类棋局数据引导，仍达到了超人类水平。人类数据未必有主动危害性，只是在足够大的规模下，其助益不再显著。此外，AlphaZero的算力消耗也远高于AlphaGo。

But does this mean that imitation learning must not play any role whatsoever in developing the first AGI or even the first ASI? No, AlphaGo was still superhuman despite being initially shepherded by human player data. The human data isn't necessarily actively detrimental. It's just that at enough scale, it isn't significantly helpful. AlphaZero also use much more compute than AlphaGo.

Speaker 0

数万年知识的积累显然对人类成功至关重要。在任何知识领域，成千上万甚至可能数百万前人参与构建我们的认知并将其传递给下一代。我们显然没有发明自己所说的语言或所使用的法律体系。甚至手机中的大多数技术也并非由当今在世之人直接发明。这一过程更像是模仿学习，而非从零开始的强化学习。

The accumulation of knowledge over tens of thousands of years has clearly been essential to humanity's success. In any field of knowledge, thousands and probably actually millions of previous people were involved in building up our understanding and passing it on to the next generation. We obviously didn't invent the language we speak nor the legal system we use. Also even most of the technologies in our phone were not directly invented by the people who are alive today. This process is more analogous to imitation learning than it is to RL from scratch.

Speaker 0

当然，我们是否真的像大语言模型那样预测下一个标记来进行这种文化学习？不，当然不是。因此，即便是人类进行的模仿学习，也不同于我们用于训练大语言模型的监督学习，但我们也不是四处奔走试图收集某种明确定义的标量奖励。没有哪种机器学习范式能完美描述人类或动物的学习方式。我们同时在实践与强化学习和监督学习相似的行为。

Now, of course, are we literally predicting the next token like an LLM would in order to do this cultural learning? No, of course not. So even the imitation learning that humans are doing is not like the supervised learning that we do for free training LLMs, but neither are we running around trying to collect some well defined scalar reward. No ML learning regime perfectly describes human learning or animal learning. We're doing things which are both analogous to RL and to supervised learning.

Speaker 0

监督学习对人类文化学习的意义，或许如同飞机之于鸟类。我也不认为这些学习技术存在本质区别。模仿学习不过是短视界的强化学习——其片段仅有一个标记长度。大语言模型基于对世界的理解以及序列中信息片段间的关系，对下一个标记做出推测。

What planes are to birds, supervised learning might end up being to human cultural learning. I also don't think these learning techniques are actually categorically different. Immutation learning is just short horizon RL. The episode is a token long. The LLM is making a conjecture about the next token based on its understanding of the world and how the different pieces of information in the sequence relate to each other.

Speaker 0

它根据预测下一个标记的准确程度按比例获得奖励。当然，我已经听到有人说，不，那不是绝对真理，它只是在学习人类可能会说什么。我同意，但有一个不同的问题，我认为这实际上更关系到理解这些模型的可扩展性。这个问题是：我们能利用这种模仿学习来帮助模型更好地从绝对真理中学习吗？

And it receives reward in proportion to how well it predicted the next token. Now, of course, I already hear people saying, no, no, that's not the ground truth. It's just learning what a human was likely to say. And I agree, but there's a different question, which I think is actually more relevant to understanding the scalability of these models. And that question is, can we leverage this imitation learning to help models learn better from ground truth?

Speaker 0

我认为答案显然是肯定的。在对这些预训练基础模型进行强化学习后，我们已经让它们在国际数学奥林匹克竞赛中赢得金牌，并从头开始编写出完整的应用程序。这些都是绝对真理的测试：你能解决这道未见过的奥数题吗？你能构建这个应用程序来满足特定功能需求吗？

And I think the answer is obviously yes. After RRL ing these pre trained base models, we've gotten them to win gold in international math olympiad competitions and to code up entire working applications from scratch. Now these are ground truth examinations. Can you solve this unseen math Olympiad question? Can you build this application to master specific features request?

Speaker 0

但你无法通过强化学习让一个模型从零开始完成这些任务，至少目前我们还不知道如何做到。你需要一个基于人类数据的合理先验来启动这个强化学习过程。无论你想称这个先验为真正的世界模型还是仅仅是对人类的建模，我认为并不那么重要。老实说，这似乎是一场语义辩论，因为你真正关心的是这个人类模型是否能帮助你开始从绝对真理中学习，也就是成为一个真正的世界模型。这有点像对巴氏杀菌牛奶的人说：嘿，你应该停止煮沸牛奶，因为你最终要提供的是冷牛奶。

But you couldn't have RL ed a model to accomplish these tasks from scratch, or at least we don't know how to do that yet. You needed a reasonable prior over human data in order to kickstart this RL process. Whether you want to call this prior a proper world model or just a model of humans, I don't think is that important. Honestly seems like a semantic debate because what you really care about is whether this model of humans helps you start learning from ground truth, AKA become a true world model. It's a bit like saying to somebody pasteurizing milk, Hey, you should stop boiling that milk because eventually you want to serve it cold.

Speaker 0

当然，但这是促进最终输出的中间步骤。顺便说一句，LLN显然正在发展对世界的深刻表征，因为有训练过程激励它们这样做。我用LLM来学习从生物学到AI再到历史的一切，它们能够以非凡的灵活性和连贯性做到这一点。LLM是否专门训练来建模其行为将如何影响世界？不，它们没有。

Of course, but this is an intermediate step to facilitate the final output. By the way, LLNs are clearly developing a deep representation of the world because there are training processes incentivizing them to develop one. I use LLMs to teach me about everything from biology to AI to history, and they are able to do so with remarkable flexibility and coherence. Now are LLM specifically trained to model how their actions will affect the world? No, they are not.

Speaker 0

但如果我们不被允许称它们的表征为世界模型，那么我们就是在通过我们认为构建世界模型必要的过程来定义这个术语，而不是这个概念所暗示的明显能力。好吧，持续学习。抱歉我又提起我的老话题。我就像一个只想到一个好段子的喜剧演员，但我要榨干它的所有价值。一个基于结果奖励进行强化学习的LLM，每集学习大约一比特的信息。

But if we're not allowed to call their representations a world model, then we're defining the term world model by the process that we think is necessary to build one rather than the obvious capabilities that this concept implies. Okay, continual learning. I'm sorry to bring up my hobby horse again. I'm like a comedian who has only come up with one good bit, but I'm gonna milk it for all it's worth. An LLM that's being RL'd on outcome based rewards learns on the order of one bit per episode.

Speaker 0

而一集可能有数万个标记长。显然，动物和人类从与环境的互动中提取的信息远不止一集结束时的奖励信号。从概念上讲，我们应该如何看待动物身上发生的事情？我认为我们正在学习通过观察来建模世界。这个外循环强化学习正在激励其他学习系统从环境中获取最大信号。

And an episode might be tens of thousands of tokens long. Now, obviously animals and humans are clearly extracting more information from interacting with our environment than just the reward signal at the end of an episode. Conceptually, how should we think about what is happening with animals? I think we're learning to model the world through observations. This outer loop RL is incentivizing some other learning system to pick up maximum signal from the environment.

Speaker 0

在理查德的橡树架构中，他称之为转换模型。如果我们试图将这个特征规范硬塞进现代LLM中，你所做的就是对你观察到的所有标记进行微调。根据我从研究朋友那里听到的实践，最天真的方法实际上效果不佳。能够以高吞吐量的方式从环境中学习，显然对真正的AGI是必要的。而在基于RLVR训练的LLM中，这显然不存在。

In Richard's Oak architecture, he calls this the transition model. And if we were trying to pigeonhole this feature spec into modern LLMs, what you do is fine tune on all your observed tokens. From what I hear from my researcher friends in practice, the most naive way of doing this actually doesn't work very well. Now, being able to learn from the environment in a high throughput way is obviously necessary for true AGI. And it clearly doesn't exist with LLMs trained on RLVR.

Speaker 0

但或许还有其他相对直接的方法可以在大语言模型之上强行实现持续学习。例如，可以设想将监督微调作为模型的一个工具调用功能。这样外层的强化学习循环就会激励模型有效利用监督学习来自我教学，以解决超出上下文窗口容量的问题。目前我对这类技术效果持中立态度——虽然我不是AI研究员，但如果它们能基本复现持续学习的效果，我也不会感到惊讶。

But there might be some other relatively straightforward ways to shoehorn continual learning atop LLMs. For example, one could imagine making supervised fine tuning a tool call for the model. So the outer loop RL is incentivizing the model to teach itself effectively using supervised learning in order to solve problems that don't fit in the context window. Now I'm genuinely agnostic about how well techniques like this will work. I'm not an AI researcher, but I wouldn't be surprised if they basically replicate continual learning.

Speaker 0

原因在于模型已经在自身上下文窗口内展现出类似人类持续学习的能力。鉴于上下文学习是从处理长序列的训练目标中自然涌现的，我认为只要信息能突破上下文长度限制流动起来，模型就能元学习到它们已在上下文环境中展现的那种灵活性。好，说几点总结思考：进化通过元强化学习塑造出强化学习智能体，而这个智能体可以选择性进行模仿学习。大语言模型的发展路径则恰恰相反。

And the reason is that models are already demonstrating something resembling human continual learning within their context windows. The fact that in context learning emerged spontaneously from the training incentive to process long sequences makes me think that if information could just flow across windows longer than the context limit, then models could meta learn the same flexibility that they already show in context. Okay, some concluding thoughts. Evolution does meta RL to make an RL agent and that agent can selectively do imitation learning. With LLMs we're going the opposite way.

Speaker 0

我们首先构建了这个仅进行纯粹模仿学习的基础模型，然后希望通过足够的强化学习将其转化为具有目标意识和自我认知的连贯智能体。也许这行不通，但我不认为那些超级原理论证——比如关于这些语言模型缺乏真实世界模型的论述——真正证明了什么。而且这些批评对当今经过大量真实数据强化学习的模型也未必完全准确。即便萨顿的柏拉图式理想最终并非通向首个AGI的路径，他的原理论述确实指出了这些模型存在的基础性缺陷。

We have first made this base model that does pure imitation learning. And then we're hoping that we do enough RL on it to make a coherent agent with goals and self awareness. Maybe this won't work, but I don't think these super first principles arguments about, for example, how these LMs don't have a true world model are actually proving much. And I also don't think they're strictly accurate for the models we have today, which are actually undergoing a lot of RL on ground truth. Even if Sutton's Platonic ideal doesn't end up being the path to the first AGI, his first principles critique is identifying some genuine basic gaps that these models have.

Speaker 0

而我们甚至没有注意到这些缺陷，因为它们在当前范式中无处不在。但凭借数十年的研究视角，这些问题在他眼中显而易见：持续学习能力的缺失、模型糟糕的样本效率、对人类可耗尽数据的依赖。如果大语言模型如我所料率先实现HEI（人类水平智能），它们构建的继任系统几乎必然会基于理查德的愿景。

And we don't even notice them because they're so pervasive in the current paradigm, but because he has this decades long perspective, they're obvious to him. It's the lack of continual learning. It's the abysmal sample efficiency of these models. It's their dependence on exhaustible human data. If the LLMs do get to HEI first, which is what I expect to happen, the successor systems that they build will almost certainly be based on Richard's vision.