本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
所以最终,今年才是算力超级周期真正开始的时刻。过去三年里,人们每个月都在说我们遇到了瓶颈。但看看这些模型的生成方式,每个环节都有巨大的改进空间——这就像用胶带勉强粘合的原始流水线,全靠大家的拼命加班和辛勤汗水在支撑,每个部分都还有太多成长空间。我觉得值得站在屋顶上大声疾呼。
So finally, this year is where the compute, like, super cycle is, like, beginning properly. People have said that we've been hitting a plateau every month for the last three years. I look at how these models are produced, and every part of it could be improved so much. It is a primitive pipeline held together by duct tape and the best efforts and elbow grease and late nights, and there's just so much room to grow on every part of it. I think worth crying from the rooftops.
所有可量化的指标似乎都在飞速提升。要押注指数级增长。
Anything that we can measure seems to be improving really rapidly. Bet on the exponential.
大家好,我是Firstmark的Matt Turk。欢迎收听本期特别节目《疯狂播客》,本周我们将庆祝Claude Summit 4.5的发布,并邀请到Anthropic首席AI研究员Shelter Douglas进行对谈。本次对话将揭秘4.5——这个全球最佳编程模型的诞生内幕,以及当AI智能体连续工作30小时会发生什么。除了产品发布,我们还深入探讨了前沿AI发展、大型AI实验室的运作方式,以及人类在通往AGI道路上的进展。
Hi. I'm Matt Turk from Firstmark. Welcome to a special episode of the mad podcast for the release of Claude Summit 4.5 this week with the incredible Shelter Douglas, a leading AI researcher at Anthropic. In this conversation, we go behind the scenes of how 4.5 the best coding model in the world and what happens when you enable AI agents to work for thirty hours straight. Beyond the launch, we talked a bunch about Frontier AI, how big AI labs operate, and how we are well on our way to AGI.
应我的要求,Sholto用平实的语言而非专业术语,拆解了强化学习、算力运用和AI基准测试等关键概念,使这次对话非常易懂。请享受与Sholto的精彩对话。Sholto,欢迎你。
At my request, Sholto made this conversation very approachable by breaking down a lot of key concepts, such as reinforcement learning, computer use, and AI benchmarks in plain English without the jargon. Please enjoy this great chat with Sholto. Sholto, welcome.
最近怎么样?很高兴来到这里。
How are doing? Great to be here.
恭喜发布Sonnet 4.5,这是本周的重磅新闻。准备这次访谈时我回顾了一下,Anthropic的发布节奏让我震惊——特别是Sonic 3.7那次重大发布。嗯...如果当时有人问我,我可能会说'那是去年的事'。
Congratulations on the release of Sonnet 4.5, which is the big news of of this week. I was just looking back as I was prepping for this, and I was struck by the pace of releases at Anthropic, in particular, Sonic 3.7, which was, like, this huge deal. Mhmm. At the time, like, I in my in my mind, if you had asked me, would have said, oh, no. That was last year.
但实际上那才发生在二月份。是的。该怎么理解这种发布节奏?这是不是进步加速的标志?
But in fact, it was just in in February. Yep. What what's the right way to think about that pace of of releases? Is that a proxy for for progress accelerating?
是的,我认为这代表了几件事。一是现在出现了双范式机制,以前我们分别进行再训练扩展和强化学习扩展,而现在我们基本处于两者的混合状态。因此我认为这提供了更多更新模型的机会,因为这意味着你可以沿着多个前沿取得进展。这样一来,最终发布更新的频率自然就更高了。
Yeah. I I think that's a proxy for a couple things. One is that there's now this two paradigm regime where previously you did retraining scaling and reinforcement learning scaling, and now we're in a mix we're in a mix of the two, basically. And so I think that gives you more opportunities to update models because it means that you can make advancements along multiple frontiers. And then that means that, yeah, you sort of just, like, end up shipping more frequently.
我认为这也反映出,现在距离ChatGPT发布已有两年多时间。后ChatGPT时代的投资周期终于开始显现效果,比如计算资源的可用性正在提升等等。这意味着我们应该预期进步速度会加快——毕竟芯片投产存在时间滞后。即使去年你急需芯片,也不可能得到,因为台积电的产能早就被预订一空。所以今年才真正开始进入计算资源的超级周期。
I think it's also a reflection of the fact that this is now two ish years after ChatGPT two and a half years after ChatGPT. And so the post ChatGPT investment cycle is finally hitting where, like, compute availability is is increasing and and all of this. And so it means that you should expect actually the pace of progress to be and because there's lead times in the in commissioning chips, basically. So even if you as much as you wanted chips last year, it would have been impossible to get them because TSMC was, you know, booked out and so forth. So finally, this year is where is where the compute, like, super cycle is, like, beginning properly in effect.
对。
Yeah.
好的。为了让听众更好地了解情况,请问Sonnet、Opus...还有Haiku是吗?能否为我们讲解这些模型之间的区别?
Okay. Great. Maybe for situational awareness for people listening to this, the Sonnet, this Opus Yes. Is this your haiku somewhere? Maybe walk us through the differences between those models.
好的。我们发布的模型分为三个等级:Opus是最智能的模型,Sonnet属于中端型号,Haiku则是速度最快、成本最低的模型。这次发布有个有趣的现象——Sonnet实际上比Opus更聪明。这种情况之前也发生过。
Yeah. So we release models along three categories, three tiers. So it's Opus, which is the smartest model, Sonnet, which is the sort of mid tier model, and Haiku, which is the the fastest, cheapest model. One of the interesting things about this most recent release is actually Sonnet is smarter than Opus. This has happened before.
事实上去年就出现过。这反映了快速进步——因为训练中端模型比大模型成本更低。所以你会在小型模型上取得很多进展,最终需要选择何时进行扩展以获得规模效益。但有时进步太快,导致你的中端模型已经足够出色。
In fact, happened last year. It's a reflection of fast progress because it it is cheaper to train, you know, mid tier models and large models. And so what happens is that you end up doing a lot of progress on smaller models. Eventually, you need to choose when to scale up and and sort of get the benefits of scale in a model. Often, make progress fast enough that your mid tier model is is is super is like great anyway.
甚至可能超越之前开发的大规模模型。我认为这也部分体现了强化学习范式的特点——你可以通过强化学习来训练和扩展一个模型。这样就能让中端模型达到六个月或三个月前大型模型的水平。
And and it's actually better than the large scale up model that you did previously. And I think you I think this is also a little bit of a reflection of the reinforcement learning paradigm where you can you can take a model and you can train it. And, you know, it is a and and extend it with reinforcement learning, basically. So that allows you to take a a mid tier model and make it as good as a larger tier model of of six months ago or three months ago.
好的。在我们深入探讨这些细节之前,我对你的故事很感兴趣。当然。你加入Anthropic的经历,以及你现在在Anthropic的工作,能否描述一下你的角色。
Alright. So before we go into all of this in in greater detail, I was curious about your story Sure. Your your journey to Anthropic and then what you what you currently do at Anthropic, how describe your role.
是的。那么你想让我从什么时候开始讲起呢?你知道的,是从
Yeah. So I think how far back do you want me to start? You know? The from
从头开始讲。是的。
the from the beginning. Yes.
从头开始。好的。有几点要说。一是在澳大利亚长大,你能选择的传统职业道路非常有限。你可以成为律师、医生,或者进入金融行业。
From the beginning. Yeah. So a couple of things. One is that growing up in Australia, there's a very traditional set of paths you can take. You can you can become a lawyer, you can become a doctor, or you can go into finance.
澳大利亚在很多方面都是一个很棒的国家。尤其是生活质量非常高,这意味着人们往往会选择这些默认的道路,过上美好的生活。我在某些方面非常幸运。我妈妈实际上对自己的抱负感到沮丧。因此,这意味着我一生中都有这样一位完美的导师。
Australia is like a wonderful country in so many ways. In particular, the quality of life is so high that it means that people just, like, choose these default paths, have a fantastic life. And I was very lucky in some ways. My mom was actually like, she was frustrated in her ambitions. And so this meant that I had the, like, the perfect mentor throughout my entire life.
她学医,后来在南非从事急诊医学,但始终无法如愿进入公共卫生领域。她想要做的是系统性变革和公共卫生工作。但在当时,这对女性来说非常困难。所以,她把全部注意力都放在了我身上。这很棒。你知道,我小时候在中国交换学习时,她给了我一份厚厚的资料,关于中国的政治经济以及当前创业生态系统中的不同参与者等等。
She studied medicine, went on to do emergency medicine in South Africa, but wasn't ever quite able to break into public health in the way that she wanted to Like, she wanted to do like systemic change and public health. And at the time, that was just very difficult for a woman. So instead, I had her full attention. So this is great. You know, like, growing up, when I did exchange in China, I got this dossier, this thick of, like, China's political economy and different actors in the, you know, current startup ecosystem and this kind of stuff.
所以我得到了这种持续推动的教育,方式非常支持且美好。我还幸运地接触到了击剑运动。通过击剑,我体验到了通过不断努力成为世界顶尖的感觉。我最好的成绩是世界排名第43位。这在一定程度上——我认为很大程度上——得益于拥有一位世界顶级的教练和完美的指导。
So I had this like wonderful constantly driving education in a really supportive and wonderful way. I also was lucky enough to get into fencing. And through fencing, had the experience of becoming one of the best in the world at something via repeated effort. I became the top top 50 in the world at at my best forty third. And it was partially a consequence of well, I I think in in, like, major pet in large part due to having a coach and perfect mentorship that was one of the best in the world.
他移居澳大利亚是因为妻子是罗马尼亚人。他刚执教完意大利队获得奥运会金牌,穆夫斯·斯特雷利奇的妻子是罗马尼亚人。她在意大利遭受歧视。所以我一方面得到了完美的学术指导,另一方面又有顶尖的运动训练指导,同时还有一个试验场——就像从小在YouTube上观看这些人,然后成为某个领域的世界顶尖。
He moved to Australia because his wife was Romanian. He just coached it he had just coached Italy to the gold medal in Olympics, Mufus Strelich, his wife was Romanian. She was facing discrimination in Italy. And so I had, on the one hand, perfect academic mentorship, and on the other hand, perfect athletic mentorship, and a proving ground to, like, watch you know, grow up watching these people on YouTube and then become one of the best in the world at something.
早期接触强化学习,比如做这个,不要用某些方式,是的。
Early introduction to reinforcement learning, like, do this, In don't do some ways, yes.
或者通过其他方式,比如你可以观看YouTube上这些人的视频,分析他们的行为如何成就了现在的自己,然后复制这个过程。是的。而且你可以成为那个世界的一部分。唯一需要的就是付出巨大的努力。
Or or in in in other ways, like, introduction to you can watch these people on YouTube and analyze what they are doing to become who they have been, who they are, and replicate that. Yep. And and you could be part of that world. All it just takes is intense amounts of effort.
没错。我发现一个令人着迷的现象:在所有领域,YouTube的根本性影响在于,无论你看哪个领域,现在的孩子似乎都比上一代强得多。对吧。完全正确。
Yeah. It's it's it's a thing that I find fascinating that across any field, the fundamental impact of YouTube and the fact that regardless of the field you you you you look at, like, every kid seems to be just much better than the prior generation. Right. Right. Exactly.
我不知道是否有相关研究,但至少这是你的亲身经历。
I don't know if it's been studied, but at least that's your your experience.
我认为AI领域也会出现同样的情况。对吧?就像每个人现在都能获得完美的导师。我后来在AI领域再次体验到了这一点。击剑并不是我想长期从事的事业。
And I think I mean, I think we should see the same thing with AI. Right? Like, in the same respect, everyone will now get a perfect tutor. I then actually had that experience again with AI. Fencing wasn't something I wanted to do ultra long term.
我想尝试冲击奥运会,然后转向科技领域工作。很幸运我读到了格恩关于规模化的文章,他在文中详细阐述了规模化假说。读完我就想,天啊,未来十年AGI的发展显然将成为世界上最有价值的工作领域。这是我们推动世界进步的最大杠杆。
I wanted to take a shot at the Olympics and then try and progress into, like like, working in technology, basically. I was very lucky to read a Guern essay on scaling, where he basically details the scaling hypothesis. After reading that, I was like, oh my god. This is absolutely like, clearly AGI progress over the next decade is gonna be one of the most, like, meaningful things to work on in the world. It's the largest level we have to have, like, meaningfully advanced the world.
于是我开始在夜晚和周末自己做研究。当时的情况是,
And so I started doing my own research on nights and weekends. And as, like,
这是你本科阶段的一部分吗?
part of part of my undergrad were you then?
对,差不多是本科最后一年以及之后的一年。
So this is, like, last year of undergrad, like, and the year after.
本科期间你学的是
And in undergrad, you did
我主修计算机科学机器人方向。从小我就对埃隆·马斯克这类人物充满憧憬,梦想着造火箭和特斯拉。但当时并不清楚具体想解决什么问题。
I did computer science robotics. Computer science. And I, like, sort of vaguely I grew up, like, you know, looking up, like, Elon Musk and this kind of stuff. You know, I wanted to, like, build rockets and Tesla. But I I didn't have, like, a concrete idea of what actual problem I wanted to solve.
读到那篇论文成了关键转折点——意识到AGI可能在本十年内实现,这似乎是世界上最有意义的研究方向。我开始思考如何证明自己应该投身其中。当时我正在研究机器人操控技术,于是尝试在卧室里规模化机器人操控训练,建立通用机器人基础模型,虽然规模很小。
Reading that essay was the critical hinge of, okay, AGI is possible this decade. It seems like the most meaningful thing in the world to work on, and I need to, like, figure out what I what like, what how I can demonstrate that I should be working on this. Mhmm. I was at the time working on robotic manipulation stuff. And so I I started working on, like, scaling up robotic manipulation, trying to train general foundation models for robotics from the bedroom, which is not a big thing.
虽然现在有很多机器人通用基础模型公司,但那时还算早期。我自己搭建模拟器,收集大量遥操作数据训练模型,还从谷歌借用了TPU。后来谷歌有人注意到我的工作,邀请我加入他们。这很幸运,因为我当时没能进入心仪的博士项目。
There's lot of general foundation models for robotics companies. It was a little bit early then, but, you know, rigged up my own simulator, collected a lot of teleoperation data, trained models, got a loan of TPUs from Google. Eventually, some people at Google noticed the work I was doing and said, hey, this is this is great work. Would you like to come work with us? It's actually very fortuitous because, for example, I didn't get into the PhD programs that I wanted to.
本科毕业后,我申请了几所这里的博士项目但没被录取。不过我非常幸运,当时做的工作与谷歌的理念高度契合,所以他们主动联系了我,这真是...
I I applied to to a couple of PhD programs here after undergrad and didn't get in. But I I was very lucky that the work that I was doing really resonated with Google, and so and so they reached out and and Which which is
这个概念很有趣——你曾遭遇学术道路上的阻碍,却依然能取得如今的成就。对于AI研究领域外的人来说,似乎学术最优秀的人就能胜出,对吧?但这是否意味着学术卓越与成为杰出人类学研究者是两回事,需要不同的特质?
a fascinating concept that at some point, you could have had an academic roadblock but still succeed to the extent that you are currently succeeding, You know, for for people maybe outside of the AI research world, it sort of feels like whoever, like, is the smartest academically wins. Right. But is does that suggest that being great academically and being a great anthropic researcher are are two different things, so you need slightly different qualities?
我认为二者高度相关,但学术界的筛选信号往往存在偏差——实际具备研究才能的人远比符合传统学术晋升标准的人多。比如在美国,本科生可能做出能登上ICLR的研究,而在澳大利亚这几乎不可能。记得Peter Abiel来我们澳洲实验室时,问谁打算去欧洲发展,连博士生都没人举手。
I think they're very highly correlated, but I think the signals that are usually used to gate academia are like, there are dramatically more people that satisfy, the criteria of being really effective than there are that have the correct signals that would then, like, enable them to progress the next stage in academic career. For example, if you're here in The US, you end up doing, as an undergrad, research that can get you in Europe, so ICLR paper. Whereas in Australia, it just isn't the case. Right? I remember Peter Abiel actually once visited our lab in Australia and asked people to put their hands up if they were going to Europe, and no one put their hands up, not even the PhD students.
这说明缺乏关键的导师指导环节,导致无法培养对重要问题的判断力,自然就缺少能展现学术潜力的信号。实际上我们现在看重的信号已非传统博士学位之类(当然这很有用)——
So it means that you don't have, again, that mentorship aspect that is so is so important. And so you don't get a chance to develop problem taste on the things that mattered. And therefore, you don't have, like, the correct signals that that indicate you would have high potential for for academia. I actually think that right now, a lot of the signals we look for aren't, you know, traditional PhD or anything like this. I mean, this is obviously, like, very useful.
最直接的信号往往是看到有人独立完成杰出博客文章。比如Anthropic性能团队负责人Simon Baum,他撰写了目前最优秀的GPU CUDA矩阵乘法优化指南,这就是标杆级的作品。
But the fastest route or, like, the the most immediate one is whenever we see a really good blog post where people have, like, done incredible amount of work in an independent fashion, it's one of the highest signal things there is. One of the examples I love to use here is this guy called Simon Baum, who's one of the leads on the performance team at Anthropic. And he's published, to date, the best guide on how to optimize a CUDA map mall on a GPU. Mhmm. It is simply the what the belt's best CUDA map mall guide.
嗯。至今无人能超越这份指南。如果有人能做到,我们第二天就会发出面试邀请。
Mhmm. No one has done this retention. Right? If someone did this retention, then I I mean, we would reach out with a job interview offer the next day. Right?
确实。前几天有人为TPU写了类似指南,我们立即决定发出面试邀约。这说明关键在于主动性和判断力——通过独立创造世界级成果来证明自己。
Yep. And and in fact, someone did it for TPU and for some kind of retention the other day, and we were like, this guy. We're like, immediately, let's send out a send out a, you know, a request for interview. So I think there is actually a there is a absence of agency. There's an absence of taste, and and there are still many ways to, like, dump and there are ways to demonstrate this, usually by producing a world class artifact in some independent fashion.
是的。再稍微聊聊这个话题和YouTube上的讨论,你看到AI研究领域的人才储备了吗?无论是学术界认可的,还是更偏向独立研究的?这个群体在增长吗?是的。好的。
Yep. And a little bit to to this conversation and the YouTube discussion, do you see the the the pool of talent in AI research, whether academically sanctioned or sort of, like, more indie? Is that is that growing? Yes. Okay.
没错。我认为这个群体正在急剧扩大。同时我觉得我们在人才培养方面做得相当不错。比如Anthropic就吸纳了许多初级人员,通过系统化的方式将他们培养成了非常出色的研究员和工程师。所以这个领域绝对是在扩张的。
Yeah. So I think it's growing quite dramatically. I also think we have done a pretty good job of of growing people. And I mean, I think, like, Anthropic has taken many many junior people and grown them into really fantastic researchers and engineers in a quite a deliberate way. So I I think it's definitely growing.
那么谷歌注意到了你,然后发生了什么?所以
So so Google noticed you, and then what happened? So
谷歌注意到我后,我入职的时间大概比ChatGPT发布早一个月左右。这个时机加入谷歌其实非常理想,因为整个公司突然需要立即应对,与Gemini项目展开竞争。这意味着原有的指挥体系完全不适合这场特殊战役——毕竟Gemini是由Brain和DeepMind合并而成的新团队。
Google noticed me. I started at Google, I think, like a month before ChatGPT or something like this. So was actually it was a fantastic time to start at Google because the entire company was suddenly forced to react instantaneously and and and compete with the Gemini program. So it meant that there was this gap of this I guess, like, the typical command structures and everything were not well suited for that particular battle. You know, it wasn't a preexisting org.
当时在决策自主权方面存在巨大空白,我们需要快速确定行动方向、组织人员攻关重点事项。这让我在Gemini初期既获得了与核心团队紧密合作培养技术品味的机会,又很快承担起多个领域的领导职责。比如我们当时根本没有适配现代大语言模型的推理架构。
You know, Gemini was sort of forged out of the found out of the merging of of Brain and DeepMind. It it meant that there was just a huge gap in terms of agency, really, of figuring out what we needed to do, doing it as fast as possible, organizing people together to to work on important things. And so I ended up, one, getting the chance to develop a lot of taste by working closely with people in those, like, early months of Gemini. But two, also quickly got the opportunity to step up and and to lead various parts of this. So one example of this is we just didn't have an inference stack that was at all, like, sensible for the modern ball of LLMs.
我们不得不从零开始设计,现在你们看到的SGLANG等框架中的很多设计,都是我们当时从第一性原理推导出来的。这个新推理架构在前六个月就节省了数亿美元成本,也让我赢得了解决复杂技术与社会政治双重难题的信任。因为推理架构问题既涉及重大技术挑战,又涉及跨团队协调——原有系统的所有权分散在五六个团队之间。
And so we had to notice that, design one from scratch. A lot of the things you now see in, you know, like the sort of SGLANGs and stuff of the world are like things that we, you know, had to derive from First Principles at that point in time. And and we wrote the inference stack. This ended up saving several $100,000,000, I think, like, even over the first six months, and meant that I was then trusted to solve both hard technical and sociopolitical problems. Because one of the interesting things about, like, the Inference Stackers problem was that it was both a really large technical challenge and a large sociopolitical one because the ownership of the preexisting SAC was distributed across, like, five or six different teams.
这意味着推动变革异常艰难,是个多维度的挑战。这段经历使我后续得以负责机器学习栈其他领域的类似问题。比如在开展推理攻关项目时,我主管研究基础设施,构建了支持大规模强化学习和推理的共享代码库。
And it meant that, like, actually enacting change was quite hard. And so we had to it's a it's a meant that it was it was a challenge along multiple dimensions. Mhmm. That then led to me having the trust to to solve problems of this form across, like, other parts of the ML stack. And so, for example, later on for the when the when the thinking strike was started, you know, the the reasoning strike, I was in charge of research infrastructure for that to get us an oral code base that could actually, like, allow us to do large scale RL and reasoning and this kind of stuff.
很好。然后是转到Anthropic的事。接着
Great. And then the move to Anthropic. And then
转到Anthropic。所以这次转职是在二月份。是的。我认为这背后有几个动机。首要原因是,公司里每个人都深切关心未来走向,这让我非常振奋。
the move to Anthropic. So the move to Anthropic was in February. Yeah. And I think it was motivated by a couple of reasons. The number one one is I'm just really excited by how deeply every single person in the company cares about how the future goes.
嗯。让我印象深刻的是,Anthropic的每位成员都能清晰阐述他们的工作如何助力更美好的未来——无论是开发能改善人们生活的AI,还是创造更安全、可控、与人类文明利益一致的AI,或是更深入理解AI内部机制以预测发展轨迹,甚至包括政策倡导。在政策层面,公司展现出了极强的推动力。
Mhmm. And I think that's one thing that really struck me is everyone at Anthropic has an articulated theory of why what they're working on contributes towards the a better future, whether that is AI that is better in in ways that, you know, can help people improve their lives or whether it's because it's AI that's more safe and controllable and and and aligned with that, you know, our civilization's interests or or even just, like, more deeply understanding what's actually going on inside this AI and try to better forecast the progress curves and and where we think we're actually headed and or policy. And probably, know, such a strong advocate for policy in many ways.
这个视角很有趣。从外部观察这些大型AI研究实验室时,难免会有疑问:它们究竟有多大差异?似乎每个团队都才华横溢,拥有同等资源。
It's fascinating as a as a thought. Again, like seen from the outside of the, you know, big AI research labs. Little bit of a question is, you know, how different are those? It seems that everybody is incredibly smart. Everybody has access to the same resources.
大体上,人们研究的问题方向相近。然后你会看到某个模型问世表现优异,一周后另一家实验室又推出更优的模型。但根据你的经验,确实能感受到文化和目标上的实质差异。
Directionally, more or less, people seem to be focusing on the same problems. And then you see, you know, one model comes out, and it's it's better. And then, you know, a week later, that's another model that comes out from another lab, and it and is better than the prior one. Right. But from your experience, you see real differences in terms of, like, culture and and goals.
是的。比如DeepMind绝对是解决科学问题的最佳平台。我认为DeepMind对AI推动科学发现的贡献将超越任何机构。
Yeah. I I think there are for example, DeepMind, if you wanted to solve science, is the best place in the world. Mhmm. Like, I think that DeepMind will directly contribute to more scientific discoveries from AI than anything else. Right?
毫无疑问。它在各方面都为此做好了准备:既有AlphaFold、材料研究等直接科研项目,也有培养AI科学家的宏大计划。而Anthropic则始终聚焦于两个核心方向。
Like, absolutely. I think it's just so well set up to do this across every aspect. Right? You've got both the direct scientific efforts like AlphaFold and the material science work and this kind of thing, and also a generally large efforts to do AI, like make AI scientists and and all that. Whereas, I think Anthropic has been laser focused on on two things.
一个是长期AI对齐问题,另一个是短期经济影响。因此Anthropic一直聚焦于编程和计算机应用等领域,我们认为这些将在未来六个月内对经济产生直接影响。值得注意的是,与DeepMind和OpenAI相比,Anthropic明显没有专注于数学推理领域。
One is like long term AI alignment, and two is near term economic impact. So Anthropic has been laser focused on coding and computer use and things that we think will make a direct impact to the economy, like, within the next six months. You know, one thing that Anthropic, like, noticeably hasn't focused on compared to DeepMind and to OpenAI is is mathematical reasoning.
嗯。
Mhmm.
对吧?DeepMind和OpenAI致力于数学推理是因为它对科学进步的意义,而且我觉得很多人对数学怀有深切热爱,渴望见证该领域的发展。我们不得不忍痛放弃这方面的研究重点,部分原因是希望集中精力实现模型的短期经济影响。其他维度的研究也大多围绕这个目标展开。
Right? DeepMind and OpenAI have been pursuing mathematical reasoning because of the implications for science and for for scientific progress and because, like, I think so many people, they just love math so deeply and and would love to see the field progress. We've had to reluctantly sacrifice a focus on that because we want to focus on well, it's partly for many reasons, but we wanna focus on near term near term economic impact with the models. And then and much of our research along other dimensions is, yeah, focus on that.
我们稍后详细探讨这点。不过在此之前,你多次提到'品味'这个词
Let's double click on this in in a minute. But before we do that, you mentioned a couple of times the word taste
是的。
Yeah.
这是2025年2月的重要概念之一。在AI研究中,'品味'意味着什么?
Which is is one of those important words in in 02/2025. What does taste mean when it comes to AI research?
我曾与一位生物学朋友就此进行过有趣讨论,比较生物学研究与机器学习中的品味。我认为最关键的是从机制层面准确理解研究目标,并保持重要的简洁性约束。在ML领域,品味往往是决定哪些内容该纳入不完美信息条件下大型训练的关键要素。
Yeah. I had a really interesting discussion about this with a with a biology friend. We were comparing taste across, like, biology research and ML. I think one of the most important things is mechanistically understanding exactly what you're trying to do and and having an important simplicity regularizer. When you think about taste in ML, it's often it's the crucial ingredient that allows you to decide what goes into your large, like, training run when you have imperfect information.
因为我们可以深入研究架构变更的具体表现及其影响。但超过一定规模后,你就只能猜测这种改变是否会与其他因素产生复合效应或冲突。毕竟你无法像全量运行那样反复测试,机会只有一次。
Because we can study very deeply what what is like, what the impact of an architectural change is. Right? But past a point, past a certain level of scale, you have to guess whether or not the impact of that change will compound with other ones, whether it will, you know, conflict. Because you can't test like your full scale run, right, n times. You only have one shot at that.
所以很多判断力体现在能否准确推断——这个方向最终是否能带来规模效益递增。同时也涉及这个研究方向是否值得投入?因为机器学习基线模型往往调优得非常好,即便理论上更优的方法也很难超越,毕竟要让一个机器学习方法奏效需要太多小技巧。这些技巧可能因各种原因失效,不像建桥工程那样能清楚知道某个剪力出现的原因。
And so a lot of taste comes from being able to make good inferences about do we think that this ultimately, like, will will sort of return deliver increasing returns to scale. It also comes down to do I think this direction of research is worth pursuing? Because often our baselines in ML are so well tuned that it's very hard to beat them even with what is, like, theoretically a better method because there are so many small tricks that are required to to make a machine learning method work. And they can fail for any number of reasons. It's not like a it's not like building a bridge where you actually have like a pretty good idea why, like, a particular, like, sheer was introduced.
可能存在各种意外情况。因此判断是应该坚持这个方向还是放弃尝试其他方法,这又是品味问题。我认为最终都回归到简约性原则——人们总爱炫技,我们都不例外。这大概就是'苦涩的教训'要表达的:几代研究者设计了各种精巧方法,将他们认为人工智能应具备的推理先验编码进去...
It can be all these quirks. And so knowing whether it's right to push along that direction or to give it up and try something else is another question, Taste. And I think it always comes back to the the simplicity regularization of people love to be clever. We all do. And and that's sort of the bitter lesson, I think, is like maybe the best expression of this, where, you know, generations of people have developed clever methods of encoding priors about how they think an artificial intelligence should reason Yep.
...将这些先验编码进模型。但所有这些最终都被规模效应抹平——通过搜索和学习这两个基础要素的规模化实现。
And encoding it into the model. And all of this gets wiped out by scale and and and through, you know, planning like, like, basically, like, search and and learning. And scale is applied to those two things.
是的,这个'苦涩的教训'指的是理查德·萨顿那篇著名文章
Yes. And the the beta lesson being the Richard Sutton essay
没错。
That's right.
AI领域人尽皆知,不过可能不是所有人都...对,不是每个人都听说过。正如你描述的:泛化能力和计算力终将胜出。
Which everybody in AI knows about, but people may not yeah. Yeah. Not everyone may have ever heard it, which is exactly what you described, this idea that generalization and compute will win over time.
是的,正是如此。能够利用计算能力的框架。没错。特别是搜索和学习将消除所有那些微调。
Yes. Exactly. Frameworks that can take advantage of compute. Yeah. And this is, like, in particular, search and learning will wash away all, like, tweaks.
我想我可以提供几个例子来让这个更具体些?卷积神经网络之所以更有效,原因之一是它们编码了一种先验知识。卷积神经网络在很多方面,你可以想象成一个小方块在图像上滑动,也就是说,相邻的像素是相互关联的。这是一个非常合理的先验假设,对吧?
I think I can I think I can offer, like, a couple of examples of this in some ways to to to make it more concrete? One way that one of the reasons that convolutional neural networks were more effective is they encode a prior. And convolutional neural networks in many ways, you think of it as like a like a little square, like being, you know, drawn across an image, so to speak, that nearby pixels are related to each other. This is very sensible prior. Right?
因为如果你把一张图片扔给AI模型,却不告诉它任何关于世界的知识,它必须学会相邻的像素形成曲线,然后构成其他事物。所以这里有一个抽象的层次结构。但最终,并非所有图像都符合这一点。因此在达到一定规模之前,卷积神经网络对绝大多数图像的处理效果会比更通用的视觉变换器更好。但超过某个临界点后,实际上你需要能够灵活整合整个图像的信息。
Because if you throw a picture at a at an AI model and you don't tell it anything about the world, it has to learn that nearby pixels form curves that then form other things. And so there's this hierarchy of abstractions. But ultimately, in that is not true of all images. And so convolutional neural networks will be better than a more general, like, vision transformer for the vast majority of images up to a certain and up to a certain amount of scale. But past a point, actually, you need to be able to flexibly integrate information across the entire image.
语言方面类似的例子可能是:我们对语法了解很多。所以你可能想把句子分解成构成要素,比如动词、名词以及它们之间的关系等等,并将这种显式结构提供给AI算法。但当你想要模型写诗或写代码时会发生什么?突然间这些假设必须被抛弃,因此你无法在诗歌、代码和写作之间实现泛化。
And a similar example in language might be, well, we know a lot about grammar. And so you might actually want to decompose a sentence into the constituent, you know, the verbs, the nouns, how they relate to each other and so forth, and provide that explicit structure to your AI algorithm. But then what happens when you want the model to write poetry or to write code? All of a sudden, these assumptions have to be thrown away, and so you can't generalize across poetry and code and and writing.
关于品味讨论中艺术与科学的部分。那么你的意思是,至少在预测训练过程可能如何发展方面,更多是直觉而非具体数字?所以你可以用具体数字来预测到一定程度...
And to the taste discussion, the art versus science part of of this. So are you saying that, at least in terms of, like, anticipating what how the the the the the training run may go, it's more intuition than actual numbers? So you can you can do actual numbers up to
没错。打个比方来说明或思考这个问题:你正在多个规模层级上测试一个系统。实际上,与生物学的类比是——你可能会在细胞、小鼠和模式生物中测试新疗法,但这并不能保证它在人类身上有效,对吧?
a point. And good got like, the way to sort of illustrate or think about this is you are testing a system at multiple levels of scale. And actually, the analog to biology was, you if you think about it, you might test a new therapeutic in a cell and in mice and in model organisms. But that's not a guarantee that it will work in human. Right?
所以你在不同规模和不同模式生物上进行测试。如果它在基础单细胞细菌中有效,在小鼠中有效,可能在猴子中也有效,这很大程度上表明它将在人类身上有效,但并非绝对保证。因此这时你需要理解这个东西运作的底层机制。
So you test across multiple different scales and multiple different model organisms. And it seems to work in, you know, basic single cell bacteria. It seems to work in mice, maybe works in monkeys. That gives you a lot of indication it's gonna work in humans, but there's not a guarantee. So that at that point, you need to understand the underlying mechanisms of how does this thing work.
比如,它结合的是哪些受体等等。在机器学习中,情况完全一样,对吧?你有不同的模型规模,然后你发现,好吧,这些模型规模确实带来了效益,我认为它应该能行得通。
Like, what receptors is it binding to and and so forth. In ML, it's exactly the same. Right? You have your different model scales, and you figure out, well, okay. It's it's delivering benefits of these model scales, And I think it should work.
从机制上讲,我理解这对模型学习动态的影响。这样你就能有信心它会奏效。但如果只是说‘哦,这是个取巧的办法’,我们并不真正理解其原理,那就真的很复杂了。
It was like mechanistically, I I understand what this is doing to the learning dynamics of the model. And then and then you can have confidence that it's gonna work. But if it's like, oh, it's a hack. We don't really understand how it works. It's really complicated.
它在代码中引入了这么多东西,然后
It introduces all this stuff in the code, then
在Anthropic这样的公司或者一般情况下,一个ID多久会失效一次?是的。
How often does a an ID fail in a Yeah. Company like Anthropic or or in general? Yeah.
或者一般来说。我想这里有个很好的例子,我曾经问过Noam Shazia这个问题。他说,大概10%的想法能行得通。那可是Noam啊。
Or in general. I mean, I think a a good example here is I once asked this question of Noam Shazia. And he was like, yeah. Maybe, like, 10% of my ideas work. And that's Noam.
嗯,对吧?你知道的,他是绝对的天才,这个领域里最顶尖的人物之一。是的。
Mhmm. Right? You know, one of the you know, an absolute genius, one of the best in the Yeah. In the field. Yep.
所以如果只有10%的想法能成功,那我认为这为想法成功的比例设定了上限。大多数想法都行不通。
So if only 10% of his ideas work, then I think that, you know, establishes a bound on on the percentage of ideas work. Most don't.
这同样是Anthropic、DeepMind等机构成功的一部分原因——鼓励人们不断尝试。我是说,那些实验成本非常高昂,对吧?
And it's part of the success, again, of a place like Anthropico, DeepMind, to just encourage people to just experiment again and again. And, I mean, those are very expensive runs. Right?
就像,
Like, the
更直白地说,巨额资本涌入这些公司的核心原因在于算力成本高昂。而我好奇的是文化层面的张力——既要因巨额资金压力而必须交付成果,又应当保持开放思维大胆探索。
the the the bigger just to say the obvious big reason behind the massive amounts of capital going into those companies is that the compute is is expensive. And I'm I'm I'm curious about the you know, culturally, the tension between, you know, you need to deliver because there's so much money at stake versus, no, you should have a just a free open mind and and just go out go for it.
这正是我们在Anthropic和DeepMind努力营造的:安全实验的文化氛围,让研究人员能长期自由探索。因为验证全新研究方向往往需要数月的独立研究。难点其实不在于实验的算力成本,而更多在于时间和注意力的投入——毕竟在当前架构和范式下仍有大量可挖掘空间,或者说存在许多唾手可得的成果。
Oh, it's one of the things that at both Anthropic and at DeepMind, we really tried to build, which is like a culture of safe experimentation where people were trusted to explore ideas for a long time out in the wild. Because you you often need months of independent research to really prove out a novel research direction that is that is like a substantially different one. This is it's hard. Particularly, I think it's hard actually less from the the compute cost of the experiments and more from a cost of time and and focus because there are so many, like, remaining wins, I guess, in even in, like, the the current architectures and paradigms and and everything in such a way that or this well, maybe there's so long so many low hanging fruit. Right?
比如最高效的时间利用方式可能是仔细分析数据,思考模型的学习机制并微调——有时最简单的调整就能带来巨大提升。因此给予团队喘息空间很重要:虽然存在短期可实现的改进,但我们更希望开发具有普适性的基础技术。这本质上是在规模化实践与探索性尝试之间寻求平衡。
Like a really high ROI use on on your time would probably be to just go and look at the data and think hard about what, like, the model is learning or doing and and like make some tweaks that makes it like there's so like, you could even just the simplest things in the world will still deliver massive gains. And so asking people or or or giving people the time and space to breathe and say, well, we know that there are short term things you could be doing, but actually like we wanted to try and develop a more general or fundamental technique that allows you to to scalably do this in future is is important. There's this, like, tension between doing things at scale and doing things that don't scale. Right? Mhmm.
我认为Anthropic等机构的研究者正在深度探索完全不同的路径——非Transformer架构,非强化学习方向。
I think people at Anthropic or other places that are deeply researching completely different avenues. So non transformers, non RL.
是的。这体现了Anthropic与DeepMind的差异:我们押注非常明确,认为通用人工智能(AGI)在未来几年内触手可及,且当前范式不会发生根本性颠覆。
Yeah. I think this is another way in which Anthropic and DeepMind differ a little bit. Anthropic is a very focused bet. We think that, you know, AGI is within reach in the next couple of years. We think that it, you know, it's the current paradigms or something you're not crazy dissimilar to them.
或许有些新东西,但并非如我们所想的那样——外面并没有什么疯狂的研究项目。对吧?实际上过去五六年,Anthropic的核心理念一直是基于现有技术体系扩展算力,认为通用人工智能(AGI)在这些框架内是可实现的。DeepMind拥有更广泛的科学文化,因为它具备这样做的资源。对吧?
Maybe there's some something new, but it's it's not like it's not like we think it's like there's some crazy out there research program. Right? Like, really for the last five or six years, Anthropics Ethos has been scaling compute with broadly the current set of techniques is like, AGI is tractable within those bounds. DeepMind has a much broader scientific culture because it has the resources to do so. Right?
Anthropic必须专注押注。DeepMind则有时间和空间去赌那些完全超出当前范式的东西。我认为这取决于你想解答什么问题——究竟是高度专注的押注,还是广泛探索各种新颖架构更好。这某种程度上体现了研究理念的差异。
Anthropic has to be a focused bet. DeepMind has the the time and space to be like, we're happy to bet on something that is, like, really far outside the current paradigm. And and I think depending on which kind of question you wanna ask, whether you think that the really focused bet or or the wide exploration of of different and novel architectures is better. That's, like, yeah, one of the sort of research ethos differences.
非常有趣。
Very interesting.
不是说Gemini本身是个非常专注的赌注。但如果你把Gemini看作一千人的团队,DeepMind还有上万人在进行各种长期基础研究。
Not to say that Gemini is like, Gemini itself is a very focused bet. Yeah. But if you look at Gemini as, like, you know, a thousand people, there's a and there's still, like, 10,000 plus people doing all kinds of really, long term foundational research at DeepMite.
明白。回到你之前提到的点,为什么Anthropic如此专注于编程?
Yep. Got it. So closing the loop on on something that you mentioned earlier. Why is Anthropic so focused on coding?
我们专注编程有两个原因。首先,这能最快帮助我们辅助AI研究——就像自动化AI研究这个概念。
Yeah. We're really focused on coding for two reasons. The first one is that we think it is the how should I say? It's the thing that will allow us to assist ourselves in AI research fastest. So there's this this notion of, like, automating AI research, right, and and that work.
我们认为进步速度的关键信号在于AI辅助AI研究的能力。预先布局这点很重要。其次,编程是短期内最具经济影响力的可行领域。Anthropic要持续研究重要课题,必须获得经济回报。嗯。
We think that one of the most important signals of whether or not we are basically, the speed of takeoff, the speed of progress is driven by how much AI is able to assist AI research. And so prefetching this is really important, we think. Secondly, we think it's the nearest term tractable problem domain to in terms of economic impact. For Anthropic to be a viable research program that can research the things that we think are important requires economic, like, return. Mhmm.
编程是一个巨大的市场,充满了真正热衷尝鲜的早期使用者,他们喜欢尝试和切换工具,对玩弄新工具充满热情。这个领域的需求极其庞大,全球对软件的需求远超优质软件的供给。我们在编译器、网页指令等每一轮技术迭代中都见证了这一点。软件的需求如此蓬勃,你知道的,确实如此。
And coding is a huge market full of people who are really, really, really, like, keen early adopters, who love trying and switching things, who are really excited to play with new tools. It's there's massive massive demand there's dramatically more demand for software in the world than there is, you know, good software. We've seen that in every previous iteration of, you know, compilers and, like, general, like, web instructions and so forth. Like, there's such a booming demand for software. And so like, you know I mean, yeah.
本质上,模型在编程领域比其他领域更早展现出优势,因为编程在某些方面是个独特且易于处理的问题——就我们拥有的数据和技术而言。你可以容器化并行运行代码,执行单元测试来验证结果。
Basically, the models are better at coding earlier than anything else because coding is a uniquely tractable problem in some respects for the the the techniques that we have in terms of the data or exists in many ways. You can containerize and run things in parallel. You can run unit tests and so you can verify.
当它有效时你自然知道
Something you know when it works and
谁知道什么时候
who knows when
它会失效
it doesn't work.
没错。要知道自动驾驶是格外困难的领域,对吧?你必须确保车辆能正常运作
Yep. You know, self driving is uniquely hard. Right? You need the car to work
是啊。
Yeah.
第一次。这有点像或者说
First time. It was kind of or like
是的。
Yes.
而编程时,模型可能会失败上百次。只要成功一次,那就没问题了。所以这里存在一种可操作性,一种可重复性,这在其他以某种方式触及现实世界的领域是不存在的。比如,你不会想要一个AI律师来为你的案子辩护。对吧?
Whereas coding, the models can fail a 100 times. As long as it succeeds once, then then that's fine. So there's this there's this tractability, there's this replayability that doesn't exist in other fields that, like, touch the real world in some ways. Like, you don't want a lawyer arguing your case that is an AI. Right?
是啊。因为如果它把案子搞错了怎么办?抱歉。太糟糕了。太糟糕了。
Yeah. Because if what if it gets the case wrong? Sorry. Too bad. Too bad.
随着技术的发展,编程具有独特的可操作性。这一点你可以看到。对吧?比如,已经有人——我自己在使用AI工具编写代码时,生产力显著提高了。嗯。
As techniques develop, coding is is uniquely tractable. It is and you can and you can see that. Right? Like, already people are dramatically I, myself, am dramatically, like, higher productivity when I'm using the AI tools to to write code. Mhmm.
我有个朋友同时管理九个Claude代码,这简直是个疯狂的数字。是啊。我不知道他是怎么做到的。我只能应付两个?所以这可能是我技能不足的问题。
And I have a friend who, like, manages nine Claude codes, which is just like a crazy number. Yeah. I don't know how he does that. I can I can only handle two? So it's maybe like a skill issue on my behalf.
不过好吧。那么SONNET 4.5号称是世界上最好的编程助手。嗯。也许可以为我们详细解释一下,包括它在SWE bench基准测试上的表现。是的。基准测试。
But alright. So SONNET 4.5 is presented as the best coding agent in in the world. Mhmm. So maybe unpack that for us, including performance on the SWE bench Yeah. Benchmark.
具体数字是多少?有哪些事实依据?然后我们再探讨其运作原理。
What are the numbers? What are the facts? And then we'll go into how that works.
目前Suitebench是业界衡量编码进展的通用基准,所有公司都用它来相互比较。虽然它在很多方面并不完美,比如其中约50%是基于某个特定网页框架的测试。但它确实采用了真实工作场景中的案例——比如提交一个移植请求,要求修改代码库。
So Suitebench is the current benchmark for how we measure coding progress in in the outside world, which all the companies use to evaluate against each other. It's an imperfect benchmark in many ways. Right? It is like fifty percent one particular web framework and and this kind of thing. But what it does do is it takes real world scenarios of work that people have done, just submitting a port request, so it changed to a code base.
所以这些都是GitHub上的实际案例。
And So that stuff that's on GitHub.
没错,都是GitHub上的真实案例。这个基准会检验模型是否能完成相同的移植请求并通过相同测试。嗯。这相当于是对软件工程师几小时工作量的一个不错的标准参照。
Stuff that's on GitHub. Right. And it checks whether or not the model is able to do that same port request and pass the same tests. Mhmm. And this ends up being a pretty decent proxy for a sort of couple hours of work from a software engineer.
明白吗?这些改动不算特别复杂,但具有合理的工作量,大约需要几小时完成。我们最近在SuiteBench的得分从72分左右提升到了78分,这是个相当大的进步。值得一提的是,就在一年前,整个行业的水平还不到20%。嗯。
Alright? These changes aren't incredibly complicated, but they're they're, you know, reasonable complexity, couple hours of work. We moved recently from roughly 72 to roughly 78 in SuiteBench, which is a pretty substantial step up. I think it's worth pointing out that it's like as recently as a year ago, I think we were under 20% or something like that as a field. Mhmm.
这显示出模型处理软件工程师基础工作的能力取得了巨大进步。虽然SuiteBench存在很多缺陷,而且可能已接近饱和点。观察AI基准测试有个有趣现象——当模型能力达到一定高度后,这些测试就失去了区分度。但我们的模型不仅是当前最先进的...
So there's been dramatic progress on the ability of models to to do this, like, unit of work that a software engineer does. I think that SweetBench is imperfect in a lot of ways. And it's probably pretty close to what we call saturated. You know, one interesting thing to look at like AI benchmarks is you you see these they they lose their utility past the point because they they no longer like disambiguate the differences between different models of like high capability. But the models, one, are SOTA.
它们在SuiteBench上全球领先,这是个不错的参考指标。但更让我们兴奋的是客户和合作伙伴的热情反馈。比如Cognition团队发现模型如此实用,以至于他们不得不重构整个系统架构来适配。确实如此。
They're the best in the world on SweetBench. It's a decent proxy. We're also, I think, more excited by the fact that a lot of our customers and partners are really excited by the model. So one example of this is the Cognition folks in Devon found the model so useful. They had to, like, rebuild their architecture around Yeah.
是的。他们在那里有一篇很棒的博客文章。
Yeah. They had a great blog post on there.
他们有一篇关于这个的精彩博客文章。发布了。是的。对吧?我认为衡量一个模型是否优秀,真正的标准在于它能否让人们做到以前无法做到的事情。
They had great blog post on this. Came out. Yeah. Right? And I think that's the real measure of whether or not people whether or not a model is good is whether or not it enables people to do things that they couldn't do before.
嗯。实际上,整个编程领域在过去一年里已经以这种方式发生了变革。让我们回顾一年前,看看3.5 Sonnet,这是第一个真正强大的代理式编程模型。第一个你可以要求它在眼前执行任务,并能与你的代码库和计算机互动的模型。在很多方面,正是这个模型促成了Cursor的产品市场契合。
Mhmm. And really coding as a whole has been transformed in this way over the last year. So let's roll back a year and we look at 3.5 Sonnet, which is the first really strong agentic coding model. The first model that you could ask to do something in front of you and it sort of was able to interact with your code base in your computer and do it. In many ways, this model is what caused the PMF for Cursor.
Cursor随着3.5 Sonnet像火箭般腾飞,因为他们占据了正确的位置,能够利用那个模型提供一种前所未有的编程体验。嗯。嗯。然后实际上,Cognition和Windsurf瞄准了更雄心勃勃的目标。基本上,这里存在一个代理能力的谱系,你可以让它完成三十秒的工作,或者几分钟的工作。
Cursor took off like a rocket right with 3.5 Sonnet because they were in the right place and they were able to capitalize on that model as as as offering a coding experience that didn't previously exist. Mhmm. Mhmm. And then, actually, Cognition and and Windsurf went for an even more ambitious target. So basically, there's like a spectrum of agency here where either you can ask it to do thirty seconds of work or a couple of minutes of work.
Windsurf作为一家公司,部分是通过更激进地押注三点五Summit的代理能力而成立的。然后进入今年
Windsurf was made as a company in part by betting more aggressively on the agentic abilities of three point five Summit. Then roll into this year
是的。顺便说一句,这对创业界的任何人来说都是一个关键教训。在2025年,就是要押注模型六个月后能实现什么。
Yeah. Which just as as a as a quick aside is one of the key lessons for anybody in the startup world Yes. In 2025, which is bet on what the models will be able to do in six months from now.
对。完全正确。押注指数级发展。所以我认为很多编程初创公司现在都在问自己,面对这些能够独立追求目标、持续时间远超以往模型的系统,他们现在能做什么?要知道,以前你必须每三十秒监督一次模型。
Right. Exactly. Bet on the exponential. So what I I think something that a lot of coding startups are now asking themselves is what can they now do with models that are capable of independently pursuing goals for substantially longer than previous goals? You know, before you had to supervise the models every thirty seconds.
随着时间的推移,在接下来的几个月里,你可能会逐渐进入一种只需每隔十分钟、二十分钟左右监督一次模型的状态。根据任务的复杂程度,这将是一个相当巨大的变化。我们甚至有几个例子,我记得在博客文章中提到过,我们要求它构建一个类似聊天应用的东西,比如Slack之类的。结果模型就这样持续工作了三十个小时。
Over time, you're over the next couple of months, you're probably gonna end up in a situation where you only need to supervise the models every, you know, ten minutes, twenty minutes or so. That's a pretty dramatic change depending on the complexity task. Even we have a couple of examples. I think it was mentioned in the blog post where we asked it to build something that looks roughly like a chat app, you know, something like Slack or or you know? And it was the model just worked for thirty hours.
就像是在电脑上连续运转了三十个小时,最终产出了一个非常完善、类似Slack或Teams的应用。这简直令人难以置信。目前市面上任何现有产品都远未达到这种程度。或许Cognition公司——我认为他们一直押注于更持久、更自主的智能代理系统——现在可能正是他们产品市场契合度爆发的时刻。
Like, was just spinning there on a computer for thirty hours and came out with a really good working Slack, like, you know, Teams like app. It's pretty incredible. Yeah. That is nowhere near built into any of, like, the existing products. Maybe maybe Cognition is Cognition, I think, has always bet on a a longer running, more independent, like, you know, agentic suite, and and maybe this is the moment that, like, really hits PMF for them, for example.
没错。我们来详细探讨下这个三十小时运作的方面,这确实很吸引人。首先为了让听众有更具体的概念,这是计算机在...
Yeah. Yeah. Let let let's unpack the thirty hour aspect, which is which is fascinating. So first of all, to to just ground it for people, so this is a computer use
应该只是在编写代码。
Just coding, I think.
对,纯粹是编码。那么这个智能代理三十小时里具体在做什么?是在不断点击操作吗?
Just just yeah. Just coding. So what does the agent do for thirty hours? It's it's clicking on stuff?
是的。它在那里读取文件、编写代码、运行测试——完全模拟人类的工作方式。
Yeah. It is there. It's it's reading files and it's writing code Yep. And running tests. So what it in in exactly the same way that a human would.
基本上你可以把这个模型想象成在一个循环中运行。它能持续决定下一步行动。人们常提到'工具使用'这个概念——顾名思义,在这里它能够使用诸如读取文件、写入文件等工具,或者在终端运行代码。它就坐在计算机终端里循环运作,不断检查当前代码并判断:'这部分功能还不完善,接下来我要处理这个问题'。
It basically, you can think of the model as running in a loop Mhmm. Where it can constantly decide what to do. People often mention something called tool use. And tool use is the ability to well, I mean, it's in the name, but in this case, it can use things like tools like read file, write file, etcetera, or run code in the terminal. And it is sitting there in a terminal on a computer in a loop just constantly looking at the current code, deciding, oh, well, it can't quite do this yet, so I'm gonna work on that next.
它经常制定计划,特别是持续运行三十小时的那种。我们对近期发布的产品感到相当满意的一点是,终于教会了模型使用所谓的记忆功能。我们已将此功能整合到AgenTic框架中,因此它能创建待办事项的Markdown文件,标记自认为重要的任务,逐一完成并核查进度。这几乎形成了一种自我验证的循环机制。
It's often making plans, particularly to run for, you know, thirty hours. One of the things that we're pretty happy with about the recent launches, we've we've finally taught the models to to use what's called memory. And so and we've built that into the AgenTic harness. So it's able to create a markdown file of to dos and things that it thinks are important to do, check them off and work on them and check whether they've been completed. There's almost this, like, self verification loop.
大约一年前左右,人们对语言模型的担忧之一是它们会偏离轨道,无法自我修正。嗯。这基本上会破坏其实用性。我认为当前一代智能体的显著特点或许在于它们能够自我修正。事实上,它们的自我修正能力惊人地出色。
One of the things that people were worried about with language models over, like, I think a year ago or so was that they would fall off track. Like, they wouldn't be able to self correct Mhmm. And that this would basically ruin the utility. I think one of the things that may be remarkable about the current generation of agents is that they can self correct. In fact, they're astonishingly good at self correcting.
嗯。这种沉浸式能力确实非常有用。
Mhmm. And and this is this, like, immersion ability has been pretty helpful.
没错。这里有很多内容需要梳理。我记得你之前提到过两个维度,一是原始智能,二是智能体能持续运作的时间。那么用最简单的话说,根本性突破是否在于:只要运作时间更长,就等于拥有了一个可以持续工作的超级智能?是的。
Yep. So so so much to unpack in on on this. So this I think I heard you speak about two axes in in the past, one being raw intelligence and the other one being how long a an agent can can operate. So is the the the fundamental breakthrough in very simple terms that if you can do it longer, you you basically have a very smart AI Yeah. That can just work longer.
正是如此。如果能保持长期连贯性,模型就能完成那些在单次思维流中不可能实现的任务。比如让你一次性写出Slack或Microsoft Teams的可用版本,你肯定做不到。对吧?是的。
Yeah. Exactly. If you can maintain long term coherency, then the model is able to do things that it couldn't possibly have done in, you know if I if I asked you to just, a single stream of thought, write a, you know, Slack or Microsoft Teams working version of one of those, you wouldn't be able to do it. Right? Yep.
你必须坐下来做笔记,建立这种闭环反馈系统。所以长期连贯性至关重要,我们认为这对智能体而言是核心要素。目前衡量这个的最佳方式可能是看Meter评估结果。没错,它们应该是我现阶段最青睐的评估标准。
You have to sit there for you have to sit there and take notes, like, do this, like, closed loop feedback system. So long term coherency is really important, and it's and it's something that we think is just like really critical for this. I think a good way of measuring this is to look at the meter evals. Yep. They're they're probably my favorite eval at the moment.
这个评估的特别之处在于:研究者选取了人类执行的一系列任务(特别是机器学习或编程领域的),标注了人类达成优质表现所需时长,然后要求AI模型完成。他们发现任务完成进度与AI能持续工作的时间范围存在强关联。现在AI能处理任务的时间范围每隔几个月就会翻倍,增长速度惊人。
And what this eval is is they've taken a bunch of tasks which humans do, particularly in, like, the machine learning or programming context, and they've annotated how long it takes a human to achieve good strong performance of those tasks. And then they ask AI models to do them. And what they found is that there's this really strong relationship between, like, progress and the time horizon over which tasks are able over which the AIs are able to complete tasks. And so I think it's like every couple of months, the time horizon that the AIs are capable of doing is doubling or something like that. It's something crazy.
或许每六个月,时间范围就会翻倍,这简直疯狂到难以置信。是的,没错。不过和所有基准测试一样,这个也不完美。
Maybe maybe every six months, the time horizon doubles, which is just it's like it's utterly insane. Yep. Yep. Yeah. Now again, like all benchmarks, this one is imperfect.
对吧?它只衡量相当简单的任务,我记得成功率大概只有50%左右,而不是99%。但这是个很好的方向性指标,也符合我的亲身体验——最近使用这些模型时,我总觉得只要设置得当,完全可以让它通宵运行,第二天早上就能收获些有用的成果。
Right? It only measures pretty simple tasks. It only measures, I think, 50% success rate or something like this at the task, not like 99% success rate. But it's a good directional measure and certainly resonates with my own experiences of, you know, as I've been using the recent models, I start to feel if I just set everything up right, I I feel like I could leave this overnight. And I could just churn away, and it would probably have something pretty useful for me in the morning.
嗯。有哪些任务是需要三十小时才能完成,而短时间运行做不到的?
Yeah. What are some examples of tasks that you can do with thirty hours that you could not do with shorter runs?
比如类似Slack这样的软件就是个好例子,这是个完整的端到端可运行软件,不是最小可行产品演示。机器学习实验这类也很有意思——需要系统能提出实验方案、写代码、运行初步测试,后续还能迭代优化。
Yeah. I think in this case, the the, like, Slack like thing is a pretty good example, where it's a a significant piece of software. It's really like an end to end working piece of software, which often takes a bit of time, like not an MVP demo. Other things I think are interesting is, like, your machine learning experiments and stuff like this are pretty interesting. You want something that's able to propose an experiment and write a bit of code, run, you know, some initial tests, come back later, etcetera.
这确实极大地拓展了可能性边界。关键区别在于产出的是真正可用的软件,而非演示原型。
Really, it opens up the world, yeah, pretty dramatically. Great. Basically, working software rather than demos, I think, is the is the key thing. Right.
太神奇了。
Fascinating.
当然不是说现在这些模型就能直接生成完整可用的软件,比如它不可能给你做出个Slack的竞品。
Now I'm not saying that the models can will, like, spin you up a full working software right now. Right? Like, it's it's not gonna spin you up a Slack competitor.
是的。虽然你们团队制作的这个图书加载AI演示相当...相当酷。确实。
Yeah. Although this book load AI demo that you guys produce was pretty It's pretty pretty cool. Yes.
对吧?我认为我们正在...
Right? I think what we're
对于没看过的人来说,这个演示展示了模型的演进过程,以及如何从几乎不可能...是的,最初只能画些简笔框架,到现在能完全复制出功能齐全的网站。没错,整个过程都是由AI自主完成的。
seeing for for people who haven't seen it, like, shows the progression of the models and how replicating the the the website Yeah. Went from basically Impossible. Yes. Caricaturing, but, like, doing wireframes to now doing a fully functional website Yeah. Based in the built autonomously by the
对。BI系统。它还包含了一些相当复杂的功能。比如'Artifacts'功能,模型可以编写代码,然后代码运行结果会直接显示在网页浏览器里。
Yeah. BI. And it got some pretty complex features. Like, you've got artifacts. Artifacts is in a is a feature where the model is able to write code, and then the results of that code are displayed in the web browser.
在这个案例中,模型完整复制了claude.ai网站,包括所有Artifacts功能和其他组件。
And in this case, the model replicated claude.ai with artifacts, with everything else.
我记不...
I can't
不太清楚具体花了多长时间,可能几个小时吧。但基本上,可以把这个看作是AI蹒跚学步的阶段。你知道的,有时候能成功,有时候会失败。
quite remember how long that one took, maybe a couple hours to do. But basically, regard this as the first halting steps of this. You know, it kind of works. Sometimes it won't work. Sometimes it will.
在接下来的六个月到一年里,预计这里会有巨大进展。是的。看看我们现在的位置与一年前相比,差异之大,我预期会有同样的飞跃。
Over the next six months, over the next year, expect dramatic progress here and Yeah. Like, look at where we are now versus where we were a year ago, and the difference is I expect the same jump,
基本上。让我们深入探讨这个突破性部分。Sunnets 4.1 我认为能运行长达七小时。而这次是三十小时,虽然我明白并非所有任务都能达到,但这是上限。你提到了记忆演化、上下文理解以及自我纠正能力。
basically. Let's double click on the on the breakthrough part of this. So Sunnets 4.1, I think, was able to run up to seven hours. In in this case, it's thirty hours, which I I realize is not across all tasks, but, like, that's the upper limit. You alluded to some of this memory evolution, this context, this ability to self correct.
或许可以更详细解释一下实现三十小时飞跃的技术进步。
Maybe maybe explain in in greater detail the the advances that enabled that jump to thirty hours.
我想这里最关键的问题,或者说我们经常自问的是:什么阻止了模型更长时间工作?或者说何时需要干预?我挺喜欢以特斯拉为例的干预模型。因为目前你需要相当频繁地干预,但通常是在品味问题上,而非原始编程能力。
I I mean, I think the the biggest things here are or the question that we often ask ourselves is what is preventing the models from from, like, working for longer, basically? Or when do you need to intervene? And I quite like the model of interventions in, like, in a Tesla sense as a example. Because right now, you need to intervene quite frequently. But it's usually on questions of taste rather than it is questions of, like, raw programming ability.
并不是说模型在决定做正确事情时无法执行。但有时候模型会走捷径,有时会忘记它们正在处理的整体结构,在上下文中迷失。它们做出的局部合理修改,在全局目标背景下其实并不合理。因此我认为很多已实现和待实现的改进,都集中在品味和上下文理解上。
It's not like the model is unable to when it's decided to do the right thing to to do it. But, you know, sometimes the models take shortcuts and sometimes the models forget the, like, overall structure of what they're doing and they sort of, like, lose themselves in the context. Right? They're doing like a a locally sensible change, but it doesn't actually make sense in the global context of what they're trying to achieve. And so I think a lot of the improvements, both that we've made and that are still to go, are on this on this, like, taste and context, basically.
关键在于让模型更擅长决定程序整体结构的智能选择,避免走捷径,写出合理优质的代码。
It's on making the model better able to decide smart things about the overall structure of the program that it was gonna do and and, like, not take shortcuts and and, like, write sensible and good code.
那内存方面呢?
What about memory?
记忆同样至关重要,因为你知道,模型最终会耗尽上下文。因此,能够随着时间的推移管理和维护记忆,甚至可以说从经验中学习,这些都可能对此大有裨益。你不希望模型不断重新发现某个系统或代码库如何运作的事实。而这个问题实际上涉及品味或所谓'苦涩教训'的讨论——因为可以想象,我们可能会投入巨大努力来教授代码模型编程品味,这或许是解决品味问题的一种方式。
Memory is also very important because, you know, the models do eventually run out of context. And so being able to manage the manage and, like, memory over time and and sort of like I suppose even learning from experiences is something which would probably help this a lot. You don't want the model to be constantly rediscovering facts about how a particular system or code base works. And now one of this is actually one of those areas where, like, the question of taste or, like, the bitter lesson comes up is because you can imagine us going and launching a massive effort to teach the code models coding taste. And that could be one way that you solve, you know, taste.
要知道,大量人类软件工程师会判定什么是好代码或坏代码。那么软件工程中的品味从何而来?我们所谓的品味通常体现在:它能让你轻松地在后期进行修改,或者便于多个智能体与你沟通协作。比如好的抽象设计往往意味着,你我可以共同开发一个代码库而不会产生冲突,对吧?
You know, heaps of human software engineers decide, well, no, this is good or this is bad or whatever. Where does taste come from in software engineering or what do we regard as taste? It's typically that it's able to, like it easily sets you up to make changes later on or so on and so forth, or it's easy for maybe multiple agents to communicate with you and collaborate. Like, often good abstractions are something that, you know, you or I could work together on a code base and and not conflict with each other. Right?
这就引出一个问题:你是应该通过让软件工程师判定优劣来教授模型编程品味,还是应该创建一个模型社会,让它们共同开发一个庞大的单体代码库——如果它们发生争执就说明设计不佳?你可以想象这个策略光谱,而选择正确的策略确实是个难题。
We can and so there is this question of how much do you focus on teaching the model coding taste via getting software engineers to decide what is good or bad, you know, or should you be creating like a society of models that all like have to together code a giant monolithic code base and, like, you know, if they're arguing, then it's bad. Like, you you you can sort of imagine the spectrum of of of potential strategies, and picking the right one there is, like, a a difficult thing.
回到去年模型甚至今年模型的性能飞跃,特别是从Sonnet 4.1到4.5的进步,再次印证了发展速度在加快的观点。有哪些关键突破...这部分我确实不便详谈。
Going back to the jump in performance from last year's models or even this year's models or even actually Sonnet 4.1 to, like, 4.5, again, to the to the point about the the pace of progress accelerating, What what were some of the some of the breakthroughs? That, I can't really talk about.
是的。我认为关键要认识到这并非某个单一突破,而是整个技术栈上众多环节持续改进的结果,很大程度上也是算力发展的函数。虽然存在个别突破,但根本上进步是相当平稳的。
Yeah. I I mean, I think it's, like, important to recognize that it's, like, not one individual breakthrough, really. It is the continuous application of lots of different things across the entire stack for many people. And like mostly just a function of compute in in many ways. There are there are obviously individual breakthroughs, but fundamentally, progress has been pretty pretty smooth.
就像在meter eval指标上,过去两年的进展可以用直线拟合。这类似于过去的摩尔定律——即使摩尔定律本身也是由无数微小改进组成的,而非某个关键突破。这是在算力这个外生动力推动下,海量工作积累的成果。
Like on the meter eval, if you look at progress over the last two years, you can plot it with a straight line. Right? And so similar to, you know, Moore's Law of the past and this kind of thing, even Moore's Law is made up of lots of individual improvements. It's not any one critical breakthrough. It's more the accumulation of huge amount of work in an environment where there's a sort of exogenous force of of compute pushing progress forward.
嗯。
Mhmm.
嗯。好的。那么或许我们可以从更抽象的层面讨论进展,但要基于2025年2月的实际情况。讨论中很大一部分似乎是从预训练到强化学习(RL)的演变,这一点我们多次提及。谈谈RL的影响,为何RL如今成为讨论的重要部分?
Mhmm. Okay. So maybe let's talk about progress at a more abstract level, but grounded in in in 02/2025. So a big part of the discussion seems to have been the evolution from a focus on pretraining to RL, which we touch upon a couple of times. Talk about the impact of RL, and why is RL such a big part of the conversation today?
对于听众来说,理解预训练与RL的高层次区别,预训练就像浏览所有现存教材的概要,而RL则是动手解题并获取对错反馈。实际上有许多知识只能通过RL习得。举个典型例子——学会回答‘我不知道’这项技能。因为在预训练中,你只是在建模文本预测,试图推测接下来可能出现的词句,比如互联网上所有文本资料。
For those listening, a good way to, like, understand at a at a high level, which is pre pre pre training an RL, pre training is like skim reading every textbook in existence. And RL is like doing the worked problems and getting feedback on whether you are wrong or right. And there are actually a lot of things that you can only learn via RL. And good example of this is the skill to say, I don't know in response to a question. Because in pretraining, remember, you're modeling the you know, you're trying to predict what what text is gonna come next in the, you know, all of these, you know, textbooks, the entire Internet in the world.
所以预训练模型说‘我不知道’的唯一理由,是它认为当前建模的文本角色可能会这么说——因为这属于合理续写。重点不在于模型是否真不知道,而在于它判断所模拟的文本角色是否会这样回答。
And so the only reason you would say I don't know on you know, as a pretrained model is if you think the character that you're modeling in the text would say I don't know. Like, if it's a likely completion. Yep. Right? Not whether you, in fact, don't know, but whether you think that the, like, the sort of cut player that you've pulled from this cast of characters that you could model would say, I don't know.
而在强化学习中,理论上可以设置测试集:包含模型知道和不知道的内容。当模型正确回答已知内容时给予奖励,对未知内容妄加回答则予以惩罚。这样模型将学会检索内部信息并评估自身置信度。因此解决‘幻觉’问题或学会说‘我不知道’,本质上需要依赖强化学习。这只是众多必须通过RL掌握的能力之一。
Whereas in reinforcement learning, you could, in theory, set up a battery of tests where there are things the model knows and things the model doesn't know, and you could reward it for correctly answering things it should know and penalize it for falsely answering when it doesn't know. And what it will then learn to do is it will learn to look up information inside itself and assess its own confidence in whether it knows that information. So so saying I don't know or solving, you know, hallucinations is intrinsically requires reinforcement learning in many ways. This is so that's what that's one example. There's a there's a whole bunch of things you you can't know otherwise know.
我认为在推理模型与语言模型RL结合的新纪元里,去年底发生的重大突破是:语言模型的RL终于开始见效。OpenAI发布的o one堪称首个严肃的RL+LLM成果,值得盛赞。这真正引发了深刻变革——它开辟了新的扩展维度:既有预训练扩展,现在又增加了测试时计算与RL扩展。
I think also an important change in, you know, in this sort of like era of reasoning models and RL on language models is at the end of last year, RL on language models finally started to work. And I think OpenAI deserves a lot of credit for, you know, releasing the first like serious RL plus LLMs release with o one. And I think this really kicked off a pretty, know, substantial change because it opened up a new axis of scaling. Right? There was pre training scaling and now there's, you know, test time compute and RL scaling.
其实所有研究机构早就在探索这个方向。例如DeepSeek能快速跟进的原因之一,就是他们此前已发布过语言模型RL相关论文。这个理念早已存在,但OpenAI的贡献在于将其具体化、公开化,并首次详细阐述了公开可验证的扩展规律。
I think this is something which all of the research labs were investigating already. And one of the reasons that DeepSeek was able to follow so fast was that they'd actually already released papers in the direction of doing RL and language models before, for example. And so this is it was already an idea in the in the air. But opening our deserves a credit for that. You know, crystallizing it, releasing it, and and detailing, know, the first public existence, less scaling laws.
为了继续这个极具教育意义的讨论,能否展开讲讲测试时计算与RL如何...
And and maybe to continue making this super educational, how do test time compute and RL
哦,对了。面试。时间冲突?
Oh, yeah. Interview. Overlap?
是的。
Yeah.
没错。理解这个问题的一种方式是,测试时的计算在进行大量推理,而强化学习则是关于这些推理正确与否的反馈信号。因此,测试时的计算是一种回答对你来说难以解答问题的方式。嗯。比如说,我问你一个你立刻就能回答的问题,就像你手背上的东西一样熟悉。
Yes. Way of of thinking about this is test time compute is doing a lot of reasoning, and then RL is the feedback signal on whether or not that reasoning was right or wrong. And so test time compute is a way of answering questions that are hard for you to answer. Mhmm. Let's say I I ask you a question that you just know off the cuff of your like, you know, off the back of your hand, basically.
这不是来自你真正了解的领域或你已经掌握的某种启发式方法。你已经把这些内化成了所谓的肌肉记忆。但对于需要你真正思考和学习的,比如你刚开始学数学时,如果我现在问你一个基本的乘法表,你可以脱口而出。但如果你是个孩子,你得一步步计算,需要展开推理链来学习它。
It's not like from a field that you really know or whatever heuristic that you've already done. You've already baked that in to, like, your muscle memory, so to speak. But for something which requires you to really think and really learn, like when you're first doing math, if I ask you a basic times table right now, you can say that off like this. But if you're a kid, you have to like do out the math and all this kind of stuff. You need to like do the reasoning chain to to learn it.
然后你会得到反馈,无论是对是错。所以,测试时的计算让你能解决比你目前能脱口而出的更困难的问题。而强化学习则让你能将这种能力提炼回模型中。这就像梯子一样,你可以不断解决稍微更难的问题,因为你正在学习解决越来越难问题的策略。
And then you get feedback whether it's right or wrong. So test time compute lets you do harder problems than you can currently do than you can like currently do off the cuff. And RL then allows you to sort of distill that back into the model. It's almost like a ladder. You can, like, constantly do slightly harder problems because you're learning strategies to to do harder and harder and harder problems.
强化学习并非新概念。我们之前谈到理查德·萨顿,他在这个领域工作了几十年,还有其他研究者也是如此。然后还有AlphaGo,那一系列非常成功、令人印象深刻的基于强化学习的成就。那么,为什么在2025年2月,似乎出现了将这两者应用于大语言模型的突破?
Reinforcement learning is not a new concept. So we were talking about Richard Sutton who's been doing work in the field for for decades and and others as well. And then there was, you know, AlphaGo, like, the whole that that whole line of very successful, impressive RL based successes. So why is it that in 02/2025, there seems to be a breakthrough to apply those two LLMs?
是啊。某种程度上这挺有趣的。很多...好吧。嗯...我该怎么表达呢?
Yeah. In some ways, it's it's quite funny. A lot of the okay. Yeah. How do I how do I say it?
以DeepSeek论文为例,他们在论文中详细阐述了两点:一是确实可行的方法,二是许多无效的尝试。实际上,某些失败的方法恰恰是导致AlphaGo成功的关键。关于语言模型强化学习最令人惊讶的是——在已验证奖励机制下的RL,它几乎是最简单的实现方式,简单到让人觉得难以置信。
Let's take the DeepSeek paper, for example. In the DeepSeek paper, they detail, one, an approach that works and two, lot of approaches that don't work. Actually, know, some of the approaches that didn't work were the approaches that led to AlphaGo's success. One of the craziest things about RL on language models in the in, like, the RL from verified rewards regime is it's almost the simplest possible thing. It's like almost too simple to work.
这又回到了审美趣味的问题:很多人认为这种方法过于简单不可能奏效,于是他们尝试了更复杂的方法,结果反而更难成功。虽然那些复杂方法可能仍有潜力,但最重要的是先掌握这个简单的方法。嗯。
And this is, again, come down to that question of taste where really, like, I think for a lot of people thought this was just too simple to work. And and so they they tried more complex methods that ultimately end up being harder to get to work. And there may still be juice in those methods. But ultimate but it was actually really important to nail the simple thing Mhmm. First.
我认为人们最初尝试的RL策略有些过于激进了。同时语言模型本身也需要达到最低质量门槛——只有当模型能真正解决有难度的编程和数学题时,才能形成'这些做对了那些做错了'的反馈循环。
And so I think people were, like, almost too ambitious with the with the the RL strategies that they tried initially. I think there's also a minimum bar in LLM quality that is required. Like, you need the model to be able to solve meaningfully difficult coding and math problems before you can get that feedback loop of, well, you solve these ones right and you solve these ones wrong.
嗯。
Mhmm.
对吧?嗯。还有个反直觉的现象是token推理链:人们长期认为需要特殊设计才能保持模型的长期连贯性。别忘了两年前,8000token对语言模型来说已经是长上下文了。
Right? Mhmm. And I think also one of, like, maybe unintuitive things is those reasoning chains of tokens. People for a long time thought that you'd need to do something clever to give the model long term coherency. You have to remember that two years ago, 8,000 tokens was in long context for a language model.
两年半前,8000token算很长的。而现在模型能用8000甚至30000token进行推理。这标志着认知转变:语言模型的底层智能已能解决合理难度的问题,其长上下文连贯性远超预期。
You know, two and a half years ago, 8,000 tokens is long. And now models are using 8,000 or, you know, 30,000 tokens to to reason about something. Right? So there was this real phase shift in oh, language models are smart enough underlying priors that they can solve sensibly difficult questions. They're actually reasonably coherent at longer context than we thought they would be coherent.
这种长token链推理能力可以在适当反馈信号下自然涌现。嗯。这有点反直觉——大多数人不会预见到推理能力能自然形成,过去普遍认为需要人工设计结构。
And that this ability to reason in long chains of tokens can emerge naturally with the right feedback signal. Mhmm. And this is a little bit counterintuitive, I think. Most most people wouldn't have expected off the bat that the ability to reason would emerge naturally. They would was a lot of thought that you have to structure it.
你必须为其提供推理策略。你需要构建所有这些内容,比如提示、暗示之类的东西。但实际上发现,其实不需要。你只需给出数学题,告诉它答案对错,模型就能学会。这归根结底是关于规模和搜索的一课——只需让模型进行搜索,提供足够的计算资源运行实验,模型最终会找到非常有效且合理的策略。
You'd have to provide strategies for it to do reasoning. You'd have to build all these things, you know, prompted and hinted and this kind of stuff. And actually it turns out, well, no. You would give a math questions, tell it whether it got them right or wrong, and the model will learn. This is it comes down to a bit of lesson in scale and search is just allow the model to search, have enough compute to run the experiments, and the model actually ends up figuring out a really effective and sensible strategy.
嗯。这就是现在正在发生的事,对吧?大型实验室基本上都在投入更多计算资源进行强化学习。
Mhmm. And that's what's happening now. Right? Like, the big labs are basically giving a lot more computer RL.
对,对。需要最低限度的基础模型质量、最低限度的强化学习计算资源,以及对长期连贯性能力的信任,做那些简单有效的事。这些听起来都显而易见,但实际上有时会有点反直觉。
Yeah. Yeah. There's, like, minimum base model quality, minimum amount of compute RL, the sort of trust in the ability for long term coherency, doing the simple thing that works. The these all sound obvious, but they're actually, like, a little bit counterintuitive sometimes.
你之前提到了AGI这个词。那你个人认为,越来越强大的LLM加上强化学习能否让我们实现AGI?是的,我认为足够了。当然,附带问题是‘实现’具体指什么,以及AGI的定义本身。
Well, you mentioned the the word AGI earlier. So is your personal sentiment that the combination of ever more powerful LLMs plus RL gets us there. Yeah. I think it's sufficient. With a with a side, you know, obvious question of what there actually means and what what AGI means Yeah.
你知道,就目前而言。
You know, today.
对。有几种定义可以采用。我认为一个实用的定义是:在大多数计算机相关任务上比大多数人类更出色。因为这对世界来说是个重要时刻——我们会意识到,智力劳动可以通过这套算法来解决,这将彻底改变世界。当然也存在其他更严格的定义。
Yeah. There's a few definitions that one could use. I think a useful one is better than most humans at most computer facing tasks. Because I think that's a really, like, important moment for the world where we go, okay, intellectual labor is is addressable via, like, this set of algorithms, and that and that totally changes the world. I think there are other definitions that are stronger that you could use.
其中之一是
One of those is
更强。那个确实相当强。
Stronger. That that was pretty strong.
是啊。抱歉。我是说,可能更难达到吧。对。
Yeah. Sorry. Sorry. I mean, like, harder to meet, maybe. Yes.
对。因为即使拥有这个,它可能仍无法像人类那样高效学习。对吧?我们人类能从极少例子中学习并泛化。
Yes. Yeah. Because you could have this, and it could still not learn as effectively as humans. Right? We learn and generalize from very few examples.
我们拥有极高的所谓样本效率。而AI模型需要数百数千倍的经验,基本上相当于数十万次人生,才能学会我们掌握的东西。但经过这数千次人生,它们确实能以极高精度掌握我们的技能。我认为过去一年重要的变化在于,强化学习终于让我们拥有一种算法,能将反馈循环转化为至少在特定狭窄领域不逊于最优秀人类的模型。这在数学和竞赛编程领域尤为明显——这两个最适宜强化学习的领域里,模型正迅速成为极其出色的竞赛数学家和程序员。
We have incredibly high what's called sample efficiency. Whereas AI models need hundreds or thousands of times more experience, hundreds of thousands of life lifetimes, basically, to learn the things that we learn. And they can they over those thousands of lifetimes, they do learn the skills that we do to incredibly high degree of accuracy. I think, like, one of the important change over the last year has been that RL has finally meant that we can we have sort of this algorithm that allows us to take a feedback loop and turn it into a model that is at least as good as the best humans at a given thing and in a narrow domain. And you're seeing that with mathematics and you're seeing that with competition code, which are the two domains that are most amenable to this, where rapidly the models are becoming incredibly competent competition mathematicians and competition coders.
对吧?竞赛编程和数学本质上并无不同,只是它们特别适合强化学习。更重要的是,它们证明了模型不存在智力天花板。
Right? There's nothing intrinsically, like, different about competition code and math. It's just that they're really amenable to RL. And any other domain but importantly, they demonstrate there's no intellectual ceiling on the models. Right?
只要给予正确的反馈循环,它们就能进行真正艰难的推理。因此我们认为这种方法可推广至人类所有智力领域。只要反馈循环得当,这些模型将达到至少与人类顶尖水平相当的程度。而当你拥有某个领域不逊于最优秀人类的存在时,只需并行运行上千个实例或提速百倍,就能获得远超任何单个人类的智能。
They're capable of doing really tough reasoning given the right feedback loop. So we think that that same approach generalizes to basically all other domains of human intellectual endeavor. We're given the right feedback loop. These models will get good enough that they are at least as good as the best humans at a given thing. And then once you have something that is at least as good as the best humans at a thing, you can just run it, you know, a thousand in parallel or a 100 times faster, and you have something that's actually, even just with that condition, substantially smarter than any given human.
这还完全没考虑能否造出超越人类智能的可能性——这看起来完全可行。毕竟大脑终究是生物计算机,理应能造出更优越的版本。
And this is completely throwing aside whether or not it's possible to make something that is smarter than a human. It like, seems entirely plausible. Right? Like, you know, the brain is ultimately a biological computer. It seems possible to make a better one.
但其中的影响相当惊人,对吧?这意味着在未来两三年内,只要有正确的反馈循环、足够的计算资源和适当的努力,我们AI行业有望创造出至少在大多数计算机相关任务上与多数人类能力相当、甚至可能媲美各领域顶尖科学家的存在。这确实很疯狂,发展会迅猛且不均衡。
But the the like the implications of this are pretty staggering. Right? Which is that in the next two or three years, given the right feedback loops, given the right compute, given the right, you know you know, elbow grease and this kind of stuff, we think that we, as the AI industry, are all on track to create something that is at least as capable as most humans on most computer facing tasks, possibly as good as, you know, many of our best scientists at their, you know, fields. This is really wild. It'll be sharp and spiky.
你知道,总会有它做不到的事情之类的例子。但世界将因此改变。
You know, this will like, there'll be examples of things that can't do and this kind of stuff. But the world will change.
你对Rich Sutton或Yama Khan的反向论点怎么看?他们似乎主张需要不同方法或仅依赖强化学习。你如何看待这场辩论?
What do you make of the, you know, the counter thesis of the, again, Rich Sutton or Yama Khan that seem to be saying that, like, a different approach is needed or RL only? What do you make of that debate?
确实。我们的模型学习效率远不及人类,它们需要上千倍的时间来学习。但我觉得这没问题,因为它们可以在模拟环境中或同时为上千家企业工作等方式经历这些学习过程。
Yeah. I think that it's true that our models don't learn in anywhere near as efficiently as humans do. Right? You they they take, you know, a thousand lifetimes to to learn. But this is, I think, fine because they can live those thousand lifetimes, whether in simulations or doing, you know, a job at a thousand firms and and so on and so forth.
我认为可能需要区分两个论点:一是架构上Transformer存在不足,我不认同这点。目前尚未发现Transformer在足够数据和算力下无法建模的情况。
I think that the maybe maybe I would disentangle. There's two arguments. One is like architecturally, the transformers are like insufficient. I don't think that's true. I think we haven't yet really found anything that transformers haven't been able to model provided sufficient data and sufficient compute.
强化学习作为目标函数很有潜力。Rich Sutton其实是强化学习的拥护者,他只是认为我们通过预训练等方式注入了过多先验知识。
I think RL as an objective is a pretty powerful one. Rich Sutton is actually a big fan of RL as an objective. He just thinks we're actually encoding too many priors in with pre training and this kind of thing.
这并不能充分表征现实世界。
It's not an adequate representation of the world.
是的。这并不是对世界的充分描述。我认为迄今为止的证据表明,我们目前的方法尚未发现任何通过足够努力仍无法解决的问题领域。嗯。嗯。
Yeah. It's not an adequate representation of the world. I think so far, the evidence indicates that our our current methods haven't yet found a problem domain that are is not tractable with sufficient effort. Mhmm. Mhmm.
而且,是的,如果有某个领域我们投入了大量努力却毫无进展,比如目标标杆整整一年都停滞不前,那我可能会收回我的话,承认这里存在根本性限制。但现实是,每次我们设立衡量重要指标的基准时,相关进展都异常迅速。
And yeah, I so I things that would make me eat my words is like if there was some domain that we put a lot of effort into that just didn't move, like the goal the goalpost, like so the, like, you know, benchmarks just didn't move as and we just couldn't make any progress for a year, then I would be like, okay. Yeah. There's there's some fundamental limitation here. Yep. But instead, what I just constantly see is every time we make a benchmark that measures something we care about, progress is incredibly rapid along that.
我觉得这值得大声宣扬一下——伙计们,任何可量化的指标似乎都在飞速提升。嗯。两三年后我们会达到什么高度?
And I yeah. I think this is like worth crying from the rooftops a little bit. I was like, guys, anything that we can measure seems to be improving really rapidly. Mhmm. Where does that get us in two or three years?
我无法确定。但值得将其纳入各自的世界观:我们极有可能开发出通用人工智能(AGI)。
I can't say for certain. Yeah. But I think it's it's worth building into, like, you know, respective worldviews that there's a pretty serious chance that that we get something that is, you know, AGI.
所以你认为人们没有意识到...这很有趣对吧?最近三四个月网上充斥着'我们已触及天花板'的论调,而你本质上在说完全相反的观点。
So you think people don't realize you know, it's it's always interesting. Right? Because, you know, reading stuff, like, online in the the last, you know, three, four months is like this theme of the, you know, we've reached a plateau. But Right. Basically saying the opposite.
对吧?我们正处于指数曲线上,但许多人没意识到这一点。
Right? We we are in an exponential curve, and many people don't realize that it's the case.
完全正确。过去三年里,每个月都有人说我们遇到瓶颈了。但看看这三年的成就,简直不可思议。另一个让我确信远未触顶的原因是:观察这些模型的生成过程,每个环节都有巨大改进空间。
Exactly. And I mean, people have said that we've we're hitting a plateau every month for the last three years. And if you look at what we've come over the last three years, it's incredible. I think that one other thing that makes me think, god, we're not anywhere close to a plateau, is I look at how these models are produced. And every part of it could be improved so much.
就像,这是一个用胶带勉强拼凑起来的原始流程,全靠大家的辛勤努力、加班加点,天啊,这让我想起——不知道这个比喻是否恰当——几个月前我和几个朋友去航海。那艘船设计得如此精妙,明显是历经数千年,或者说数个世纪人类设计与智慧的结晶。我当时就想,哇,原来这就是站在人类集体智慧巅峰的感觉啊。
Like, it is a primitive pipeline held together by duct tape and the best efforts and elbow grease and late nights and, like, god, it's actually, I remember I don't know if this is a good analogy or whatever, but I remember I went sailing with a couple of friends a few months ago. And the boat was so well designed. It was just like clearly the product of, you know, like, millennia or, like, know, centuries of of, like, accumulated human design and effort. And I was like, wow. Like, is this is what it feels like to be in sort of the the accumulation of a lot of human effort.
对吧?当今最优秀的帆船设计确实很难被超越。嗯。但当我看着LLM训练流程时,那只是两年半时间里临时抱佛脚的产物。每个环节都还有巨大的提升空间。
Right? It's actually pretty hard to beat today's best sailboat designs. Mhmm. But when I look at an LLM training pipeline, it is two and a half years of best effort, last minute desperate effort. And and there's just so much room to grow on every part of it.
首先,号称全球最强编程模型的Sonnet 4.5,在经济学研究、金融等其他领域也表现优异。它已经...呃...就这一点来说...
So first first of all, so Sonnet 4.5, which is described as the best coding model in the world, also seems to be performing across a lot of different other domains like economics research and and and finance. So it's it's already Yeah. Just just to to one
最让我兴奋的是那个GDP评估体系——对,就是OpenAI发布的那个。虽然Sonnet 4.5刚发布还没上榜,但4.1 Opus已经是榜首了。我认为这个评估非常有意思且有价值,因为它展示了模型在...对...
of the things was really excited by actually was that there was that GDP eval that Yeah. You know, OpenAI released. And I mean, I'm not sure. Sonnet 4.5 was only just released, it's not on the but 4.1 Opus was was the leading model there. I think that's, like, a really interesting and really good eval because it demonstrates such a breadth of tasks Yeah.
经济各领域的广泛能力。
Across all parts of the economy.
这是个覆盖经济各行业的评估体系对吧?制造业等领域,他们请专家定义了成功标准。现在模型不仅能测编程等单一任务,而是可以全面评估了。
It's an eval that's across the various sectors of the economy. Right? So manufacturing, and, basically, they took a bunch of experts to describe what success looks like. Yes. And now the models are gonna be able to be measured not across just coding or some limited task, but across everything.
这样...这样算公平吗
Is that is that a fair way
描述它?好的。我很久以来就一直希望有人能这么做。你知道吗?是的。
of describing it? Okay. I've wanted someone to do this for a long time. You know? Yes.
对,对。以劳工统计局为例。我认为政策制定最重要的参考依据应该是:拿劳工统计局的数据,分解所有工作岗位的任务,看看AI模型能否完成这些任务,并衡量其随时间推移的进展。对吧?
Yeah. Yeah. Take the Bureau of Labor Statistics. I think the most important input into policy would be take the Bureau of Labor Statistics, take all the jobs there, break them down to tasks, and see whether the AI models are able to do that and measure progress over time. Right?
没错。这显然不会是完美的衡量标准。我们可能会在GDP评估上达到甚至超越人类水平,但这在经济层面不会带来实质改变,因为这些只是连接组织,就像那些上下文关联,实际任务并不具有代表性。但我们会找到更好的方法来衡量这些难点,不断推进基准测试。我期待有人做这件事已经很久了。
Yep. And this is obviously gonna be in perfect measure. We'll probably reach like better than human on the GDP eval, and it won't change anything economically because it'll be all the connective tissue and all the like, you know, the context and actually like the task won't be representative. But again, we'll then find better ways to measure these difficulties and and and we'll keep pushing benchmarks. I've wanted someone to do this for long time.
我真的很高兴他们这么做了。特别欣慰我们的模型具有通用性,整体表现强劲,在各个领域都名列前茅。政策制定者真应该研究这个方向,扩展它,真正投入资源来验证我们是否如我所宣称的那样走在正确的轨道上。明白吗?这是可量化的。
I'm really glad that they did it. I'm really glad that our models were general and, you know, generally strong and sort of showed up, you know, top across all the areas. And I think policymakers should really look at this and extend it and, like, really make an effort in investing in figuring out whether we are on track for for what I've been claiming we're on track for. Right? We can measure this.
是的。我们也理应如此。
Yes. And we should be.
没错。那么,关于最后这个主题做个总结——太棒了,一切都令人振奋。我们该怎么做?
Yes. So, yeah, just to to close on that last theme. So awesome. All very exciting. What do we all do?
我们该如何为这个似乎近在眼前的世界做准备?是啊。
How do we prepare for this world that seems to be around the corner? Yeah.
我认为最具行动价值的建议是:持续为一个能让个人获得更大杠杆效应的世界做准备。对吧?现在,我可以用两个编码代理完成相当于之前两倍的工作量。如果编码代理按我预测的方向发展,一两年后,你基本上就能管理一个为你全天候工作的团队。我们应该预期在数字领域,个人将在未来几年获得显著增加的净杠杆效应。
I think the most actionable piece of advice is keep planning for a world where you as an individual have more leverage. Right? Right now, I can use two coding agents to do the like, twice the work that I could have done before. If coding agents progress in the way I've been saying, in a year or two, you'll be able to you'll be able to manage a team, basically, that works twenty four seven for you doing work. I think we should expect in the digital domain for individuals to get dramatically more net leverage over the next couple of years.
我认为还有诸多极其重要的问题需要关注——我们的世界在太多方面仍不完美。人们仍生活在极度贫困中,健康医疗问题尚未解决,住房问题也完全悬而未决。
I think then many, like, incredibly important problems. You're gonna track, like, our world is so imperfect in so many ways. People still live in dramatic poverty. You know, health and medicine is unsolved. Housing is, you know, completely unsolved.
这个世界在无数方面都有百万倍的改进空间。我希望人们能先利用模型获得对数字世界的掌控力,继而通过机器人技术获得对物理世界的杠杆效应,从而带来翻天覆地的改变。
Like, the the world could be a million times better in so many different ways. And what I hope is that people take, you know, initially models giving us leverage over the digital world, and then hopefully models giving us leverage over the physical one through robotics to to, like, dramatically improve it.
这个愿景正在实现吗?机器人技术似乎是另一个关键主题。但另一方面——用'手'这个词来说——人们似乎仍在努力让机械手实现精准动作。物理规律似乎成了制约因素。
Is that is that happening? Robotics, that's another thing that seems to be, like, one of the key themes. But on the other hand, to use actually the word hand, it seems like people are are just still struggling to make hands move Right. The way. So, like, the the the physics of it seems to be the limiting factor.
是的。这就是所谓的莫拉维克悖论:人类觉得容易的动作(比如抓取物品)对AI极难,而人类觉得困难的数学推理对AI反而简单。我个人认为这个悖论有点伪命题的意味,这本质上是数据可得性和强化学习信号的问题。观察机器人运动能力(行走平衡等)的进展就是个很好的例证。
Yeah. There's there's this thing called Moravec's paradox, right, which is that things which we find really easy, like manipulation, picking up objects, are really hard for AI, but maybe things which we find are hard, like reasoning through mathematical problems, easy. I actually think more of X Paradox is little bit fake. And I think this is mostly a question of data availability and like, you know, RL signal and stuff. And I think one interesting one one right reason to look at this, if you look at like robotic locomotion, so, you know, the ability of robots to walk around and and balance and stuff.
看看现在Unitree机器人的视频,与两年前的差距令人震惊。它们灵活得不可思议——有个视频里机器人被踢倒后,居然能像《黑客帝国》那样酷炫地翻身而起。
And look at the videos of the unitary robots. This difference now versus two years ago is crazy. These things are incredibly agile. Like, there's this video, I think someone kicking one over, and it literally, like, does a matrix kind of, like, get back up thing. It's crazy.
这是因为运动控制很容易获得强化学习信号。说实话,基础强化学习已基本解决了运动问题。物体操控更难些,但有几个迹象让我相信机器人技术会突破:首先,今年各机器人实验室取得了惊人进展。
This is because locomotion is a really easy RL signal. And right now, you can pretty much like, locomotion is kinda solved, to be honest, with basic RL. Manipulation is a bit harder. But there's a few things that make, you know, that make me think that robotics is gonna work. For starters, I've seen incredible progress from the robotics labs over this year.
确实,它们已经发展到能够完成相当有趣的基础物理任务了。二是存在巨大的生成器与验证器之间的差距,这也是改进模型困难的原因之一——我们需要不断找到能击败当前模型的人来提升它们。但在机器人领域,我们正在打造真正智能的通用模型,因此可以将其作为老师或裁判,判断机器人行为是否正确。比如我说'把红色积木叠在蓝色积木上',就可以询问语言模型:它是否正确地堆叠了积木?若是,则给予奖励。
Really, they've gotten to the point where they can do pretty interesting, you know, basic physical tasks. Two is the existence of a large generator verifier gap, which is that one of the reason that one of the things that makes improving our models hard is we constantly need to find, you know, people who can beat the models of the things we wanna improve them on. But with robotics, we're making really smart general models so you can actually have these, like as teachers or or judges for whether or not the robot is doing the right thing. If I say, stack the red block on top of the blue block, we could then ask the language model, did it stack the blocks appropriately? If so, give it a reward.
若未完成,则不奖励。这样就能利用生成验证差距来提供道德反馈。最后,机器人学界长期认为必须解决长期连贯性和规划问题,而语言模型也简化了这个过程——它们能将任务分解为多个步骤。
If not, don't. So you can use the generator verify gap to give morals feedback. And finally, for a long time in robotics, people thought they would have to solve long term coherency and planning. And that's also something that language models have made easier. They can break things down into multiple steps.
所以所有机器人实验室都在全力研发卓越的运动策略,并取得了惊人进展。我认为这本质上是个数据与反馈循环的问题。
So all the robotics labs are focused really hard on making great motor policies, and they're making incredible progress. It's mostly just the I think, like, a data and feedback loop question.
好的,夏尔特。这次对话太精彩了。我还能想到四十个问题想立刻请教你,但你已如此慷慨地分享了这么多时间。非常感谢,这次访谈太棒了。
Alright, Shalte. It's it's been fascinating. I can think of another 40 questions that I would wanna ask you right now, but you've been incredibly generous with your time. Thank you so much. This was terrific.
真心感激。
Really appreciate it.
非常愉快,衷心感谢。嗨。
It was a real pleasure. Thank you very much. Hi.
我是马特·特克。感谢收听本期MAD播客。若您喜欢,恳请订阅(若尚未订阅),或在您收听/观看的平台留下好评。这对我们打造优质播客、邀请杰出嘉宾至关重要。谢谢,下期再见。
It's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you on the next episode.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。