人工智能发展放缓了吗？内森·拉本茨表示我们问错了问题

本集简介

内森·拉本兹（Nathan Labenz）是分析人工智能发展动向最清晰的声音之一，他将敏锐的技术洞察与多年在《认知革命》的工作经验相结合。在本期节目中，内森与a16z的埃里克·托伦伯格（Erik Torenberg）共同探讨一个紧迫问题：人工智能的进步真的在放缓，还是我们已对突破习以为常？他们讨论了关于GPT-5的争议、推理与自动化技术的现状、智能体与工程工作的未来，以及如何为人工智能的下一步构建积极愿景。资源：关注内森的X账号：https://x.com/labenz 收听《认知革命》：https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk 观看《认知革命》：https://www.youtube.com/@CognitiveRevolutionPodcast 保持更新：若喜欢本期节目，请点赞、订阅并分享给朋友！关注a16z的X账号：https://x.com/a16z 关注a16z的领英：https://www.linkedin.com/company/a16z 在Spotify收听a16z播客：https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX 在Apple播客收听a16z播客：https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 关注主持人：https://x.com/eriktorenberg 请注意，此处内容仅供信息参考；不应视为法律、商业、税务或投资建议，亦不可用于评估任何投资或证券；且不针对任何a16z基金的现有或潜在投资者。a16z及其关联方可能持有讨论企业的投资。详情请见a16z.com/disclosures。保持更新：关注a16z的X账号关注a16z的领英在Spotify收听a16z播客在Apple播客收听a16z播客关注主持人：https://twitter.com/eriktorenberg 请注意，此处内容仅供信息参考；不应视为法律、商业、税务或投资建议，亦不可用于评估任何投资或证券；且不针对任何a16z基金的现有或潜在投资者。a16z及其关联方可能持有讨论企业的投资。详情请见a16z.com/disclosures。由AdsWizz旗下Simplecast托管。关于我们收集和使用个人数据用于广告的信息，请参见pcm.adswizz.com。

Nathan Labenz is one of the clearest voices analyzing where AI is headed, pairing sharp technical analysis with his years of work on The Cognitive Revolution. In this episode, Nathan joins a16z’s Erik Torenberg to ask a pressing question: is AI progress actually slowing down, or are we just getting used to the breakthroughs? They discuss the debate over GPT-5, the state of reasoning and automation, the future of agents and engineering work, and how we can build a positive vision for where AI goes next. Resources: Follow Nathan on X: https://x.com/labenz Listen to the Cognitive Revolution: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk Watch Cognitive Revolution: https://www.youtube.com/@CognitiveRevolutionPodcast Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Find a16z on X: https://x.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated: Find a16z on X Find a16z on LinkedIn Listen to the a16z Podcast on Spotify Listen to the a16z Podcast on Apple Podcasts Follow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

人工智能并不等同于语言模型。人工智能正以非常相似的架构被开发，应用于各种不同的模态，并且数据量要大得多。反馈开始从现实中涌现。也许我们正在耗尽已经解决的问题。当我们开始给下一代模型这些强大的工具，它们开始解决以前未解决的工程问题时，我认为我们开始看到某种类似超智能的东西。

AI is not synonymous with language models. AI is being developed with pretty similar architectures for a wide range of different modalities, and there's a lot more data there. The feedback is starting to come from reality. Maybe we're running out of problems we've already solved When we start to give the next generation of the model these power tools and they start to solve previously unsolved engineering problems, I think you start to have something that looks kind of like superintelligence.

Speaker 1

关于人工智能进展是否已进入平台期，还是我们的期望刚刚赶上变化速度，争论日益激烈。在本期节目中，我邀请了《认知革命》的主持人Nathan Labenz，共同探讨人工智能创新是否真的在放缓。我们从Cal Newport认为学生使用AI变得更懒惰的论点，到GPT-5相比GPT-4并非飞跃的说法，逐一剖析放缓论的依据。Nathan和我深入探讨了AI底层真正发生的变化，从新的推理和数学能力到真实的科学发现，以及远超聊天机器人的多模态系统。我们还讨论了智能体、自动化，以及工作本身开始改变的速度。

There's a growing debate about whether AI progress has plateaued or if our expectations have simply caught up to the pace of change. On this episode, I'm joined by Nathan Labenz, host of the Cognitive Revolution, to unpack whether AI innovation is actually slowing. We break down the case for slowdown from Cal Newport's argument that students are using AI to get lazier to the claims that GPT-five wasn't a leap over GPT-four. Nathan and I look at what's really happening under the hood of AI, from new reasoning and math capabilities to real scientific discoveries and multimodal systems that go far beyond chatbots. We also discuss agents, automation, and how quickly work itself is starting to change.

Speaker 1

最后我们会探讨这个重大问题：如果进步没有放缓，我们应如何引导它走向我们真正想要的未来？让我们开始吧。

And we'll end on the big question. If progress isn't slowing down, how should we shape it toward a future we actually want? Let's get started.

Speaker 2

Nathan，非常高兴你第一次做客ACNZ播客。显然，我们作为播客搭档已经很久了，你一直主导着《认知革命》。欢迎你。

Nathan, I'm stoked to have you on the ACNZ podcast for the first time. Obviously, we've been podcast partners for a long time with you leading Cognitive Revolution. Welcome.

Speaker 0

很高兴来到这里。谢谢

It's great to be here. Thank

Speaker 2

你。所以我们之前聊到Cal Newport在《Lost Debates》播客上的露面，我们认为这是个好机会来展开这场广泛的对话，真正探讨这个问题：AI是否在放缓？不如你先为他或更广泛的那一方所提出的一些论点做个 steel man（最强反驳），然后我们可以进行更广泛的讨论。

you. So we were talking about Cal Newport's podcast appearance on Lost Debates, we thought it was a good opportunity to just have this broad conversation and really entertain this question of, is AI slowing down? So why don't you steel man some of the arguments that you've heard on that side of from him or more broadly, then we could have this broader conversation.

Speaker 0

是的。我认为首先，关于人工智能，区分几个不同的问题非常重要。一个是，它现在对我们是否有益，以及从大局来看它是否会对我们有益？然后我认为这与我们所见的能力是否继续以相当快的速度进步是非常不同的问题。所以实际上，在你分享给我的Cal Newport播客中，关于AI可能已经对人们产生的影响的一些担忧，我有很多同感。

Yeah. Think for one thing, it's really important to separate a couple of different questions, I think, with respect to AI. One would be, is it good for us right now even, and is it going to be good for us in the big picture? And then I think that is a very distinct question from are the capabilities that we're seeing continuing to advance and at a pretty healthy clip. So I actually found a lot of agreement with the Cal Newport podcast that you shared with me when it comes to some of the worries about the impact that AI might be having even already on people.

Speaker 0

你知道，他会观察学生们的学习情况，发现他认为他们基本上是在用AI偷懒，这其实不是什么新鲜事。我想很多老师都会这么说。他用了更委婉的说法，说人们未必效率更高，而是通过让AI代劳来减轻自己大脑的工作负担。而且，我觉得他一直是社交媒体影响方面非常有价值的评论者。

You know, he looks over students' shoulders and watches how they're working and finds that basically, he thinks that they are using AI to be lazy, which is no big revelation. I think a lot of teachers would tell you that. Soccer. Puts that in, yeah, puts that in maybe more dressed up terms that people are not even necessarily moving faster, but they're able to reduce the strain that the work that they're doing places on their own brains by kind of trying to get AI to do it. And, you know, if that continues and I think he's been, I think, a very valuable commenter on the impact of social media.

Speaker 0

当然，我认为我们都应该注意自己的注意力持续时间是如何随时间变化的，以及我是否变得软弱或厌恶艰苦工作。如果这些趋势在自己身上显现，那可不是好兆头。所以我觉得他提醒大家注意这类问题是非常正确的。就像我们过去多次讨论的那样，我对AI的最终影响有很多疑问，我想他可能也有。但从我的角度看，奇怪的是他从'现在存在这些问题，长远来看也许...'转到'但不用担心'。

Certainly, I think we all should be mindful of how is my attention span evolving over time, and am I getting weak or averse to hard work? Those are not good trends if they are showing up in oneself. So I think he's really right to watch out for that sort of stuff. Then as we've covered in many conversations in the past, I've got a lot of questions about what the ultimate impact of AI is gonna be, and I think he probably does too. But then when it comes to it's a strange move from my perspective to go from there's all these sort of problems today and maybe in the big picture to but don't worry.

Speaker 0

就像停滞不前了。有点担心，但不用担心，因为它不会比现在更进一步了，或者说它的扩展已经逐渐减弱，或者我们不会得到比现在更好的AI了。甚至在我看来最容易反驳的说法是GPT-5并没有比GPT-4好多少。这一点真的让我很惊讶，等等，这是什么意思？

It's flatlining. Like, kind of worry, but don't worry because it's not really going anywhere further than this, or it's scaling has kind of petered out, or we're not gonna get better AI than we have right now. Or even maybe the most easily refutable claim from my perspective is g p t five wasn't that much better than g p t four. And that, I think, is where I really was like, what? Wait a second.

Speaker 0

我在很多方面都同意你的看法，他观察到学生的一些行为，我自己也承认有过。现在当我尝试编写代码时，很多时候我会想，天啊，AI就不能搞定吗？我真的不想坐在这里读代码、搞清楚怎么回事。现在甚至都不是打字的问题了。

I was with you on a lot of things, and some of the behaviors that he observes in the students, would cop to having exhibited myself. When I'm trying to code something these days, a lot of times, I'm like, oh, man. Can't the AI just figure it out? I really don't wanna have to sit here and read this code and figure out what's going on. It's not even about typing the code anymore.

Speaker 0

你知道吗？我太懒了。但甚至理解代码如何运作都让我懒得做。就不能让它直接运行吗？再试一次。

You know? I'm way too lazy for that. But it's even about figuring out how the code is working. Just just can't you just make it work? Try again.

Speaker 0

你知道吗？就再试一次。我确实发现自己有时会陷入这些陷阱。但我想说，我能陷入这些陷阱的一个重要原因是AI变得越来越好。越来越让我觉得它们有可能搞定问题，这种想法并不疯狂。

You know? And just try again. And I do find myself at times falling into those traps. But I would say a big part of the reason I can fall into those traps is because the AIs are getting better and better. And increasingly, it's not crazy for me to think that they might be able to figure it out.

Speaker 0

这就是我对听到的这些观点初步的看法。也许可以画一个2x2矩阵：你认为AI现在和未来是好是坏，以及你认为它是大事还是小事？我认为它既好又坏，但我绝对认为它是件大事。我最难理解的是那些不觉得这是大事的人，在我看来这很明显，尤其是在GPT-4到GPT-5的飞跃方面。也许发生这种情况的一个原因是GPT-4和5之间发布了更多版本。

So that's my kind of first slice at the takes that I'm hearing. There's almost like a two by two matrix maybe that one could draw up where it's do you think AI is good or bad now and in the future, and do you think it's, like, not a big deal or a big deal? And I think it's both on the good and bad side, but I definitely think it's a big deal. The thing that I struggle to understand the most is the people who kind of don't see the big deal that it seems pretty obvious to me and the especially when it comes again to the leap from g p d four, g p five. Maybe one reason that's happened a little bit is that there were just a lot more releases between g p t four and five.

Speaker 0

所以人们拿来比较的是，你知道，几个月前才发布的东西，o3，对吧，它只比GPT5早几个月发布。而GPT4是在ChatGPT之后不久发布的，当时大家都觉得，哇，这东西简直是横空出世。很多人是第一次见到它。如果你回头看GPT3，那是一个巨大的飞跃。

So what people are comparing to is, you know, something that just came out a few months ago, o three, right, that only came out a few months before GPT five. Whereas with GPT four, it was shortly after ChatGPT, and it was all kind of this moment of, woah. This thing is, like, exploding onto the scene. A lot of people are seeing it for the first time. And if you look back to GPT three, there's a huge leap.

Speaker 0

我认为从GPT4到GPT5的飞跃是类似的。这些东西很难打分。没有一个单一的数字可以衡量。嗯，有损失率。但当然，一个大的挑战是，损失数字究竟在能力上意味着什么？

I would contend that the leap is similar from GPT four to five. These things are hard to score. There's no single number that you could put on it. Well, there's loss. But, of course, one of the big challenges is that what exactly does a loss number translate into in terms of capabilities?

Speaker 0

所以很难确切描述到底发生了什么变化，但如果你愿意，我们可以探讨一些变化的维度，列举一些我认为人们可能已经开始或已经视为理所当然甚至忘记的事情，比如GPT4并没有很多现在……是的，在GPT5发布时被期待的功能，因为我们在4o、o1和o3中见过它们，所有这些可能在一定程度上让人们温水煮青蛙，对这次最新发布所感知的进步程度产生了影响。

So it's very hard to describe what exactly has changed, but we could go through some of the dimensions of change if you want to and enumerate some of the things that I think people maybe are starting to or have come to take for granted and kind of forget, like, that g p t four didn't have a lot of the things that now Yeah. Were sort of expected in the g p t five release because we'd seen them in '4 o and o one and o three, and all those things sort of maybe boiled the frog a little bit when it comes to how much progress people perceived in this last release.

Speaker 2

是的。我有几点反应。一是，甚至让你的2x2矩阵更复杂一点，从“现在不好”与“以后不好”的角度来看。Cal并不是我们俩都非常钦佩的人，顺便说一句。Cal是个很棒的人，也是思想领域的重要贡献者，但他并不像AI安全人士和其他许多人那样关心这类未来的AI问题。

Yeah. A couple of reactions. So one is and even to complicate your two by two, even further in the sense of, is it bad now versus it bad later? Cal is not really who we both admire, by way, a lot. Cal's a great guy and a valuable contributor to the thought space, But he's not as concerned about sort of this sort of future AI concerns that sort of the AI safety folks and many others are concerned about.

Speaker 2

他更关心的是它对当前认知表现和发展的生活意味着什么，就像他担心社交媒体影响一样。你认为这是一个担忧，但远不及对未来预期的担忧那么大。然后，他还提出了一个为什么我们不应该担心未来的理论，因为它在放缓。我们为什么不分享一下我们是如何解读他的历史观呢？据我解读，他的观点是，嘿，简化版就是我们已经找到了这种方法，如果你把一堆数据扔进模型，它就会按数量级变得更好。所以GPT2到GPT3，然后GPT3到GPT4之间的差异是显著的，但随后收益显著递减。

He's more concerned about what it means to life for cognitive performance and development now in the same way that he's worried about social media's impact. And you think that's a concern, but nowhere near as big a concern as what to expect in the future. And then also, he presents sort of this theory of why we shouldn't worry about the future because it's slowing down. Why don't we just share how we interpreted kind of his history, which as I interpreted it was this idea of, hey, the simplistic version is we've figured out this way such that if you throw a bunch of data into the model, it gets better in sort of order of magnitude. And so the difference between GPT-two and GPT-three and then GPT-three and GPT-four, but then that sort of significant the the difference, but then achieved sort of a diminishing returns significantly.

Speaker 2

我们在GPT5上没有看到这种情况，因此我们不必再担心了。你会如何修正对他历史观的描述，然后我们可以深入讨论4和5之间的差异？

And we're not seeing it at g p d five, and thus we don't have to worry anymore. How would you edit the characterization of his view of the history, and then we can get into the differences between four and five?

Speaker 0

关于缩放定律的想法，这绝对值得同意，但需要花点时间指出，它不是自然法则。我们没有原则性的理由相信缩放是某种会无限持续下去的定律。我们真正知道的是，到目前为止它已经维持了相当多的数量级。我认为目前还不清楚缩放定律是否已经耗尽，或者我们是否只是找到了一个更陡峭的改进梯度，在另一个我们可以推进的方面给了我们更好的投资回报率。所以他们确实训练了一个更大的模型，即GBT4.5，并且确实发布了它。

The scaling law idea, which is definitely worth agreeing, taking a moment to note that it is not a law of nature. We do not have a principled reason to believe that scaling is some law that will go indefinitely. All we really know is that it has held through quite a few orders of magnitude so far. I think that it's really not clear yet to me whether or not the skilling laws have petered out or whether we have just found a steeper gradient of improvement that is giving us better ROI on another front that we can push on. So they did train a much bigger model, which was GBT 4.5, and that did get released.

Speaker 0

当然，有很多有趣的基准测试，随便就有上百万个。我最关注的一个基准测试是Simple QA，它其实就是一个超长尾的冷知识测试，用来理解GPT-4.5与O3和GPT-5的关系。OpenAI显然在命名方面臭名昭著，这一点我们都同意。我认为相当一部分困惑和分歧确实源于失败的命名决策。

And there are a number of interesting of course, there's a million benchmarks, whatever. The one that I zero in on the most in terms of understanding how gbt 4.5 relates to both o three and gbt five, and OpenAI, obviously, famously terrible at naming. We can all agree on that. I think a decent amount of this confusion and sort of disagreement actually does stem from unsuccessful naming decisions. 4.5 on this one benchmark called simple QA, which is really just a super long tail trivia benchmark.

Speaker 0

它实际上只是测量你是否知道大量冷门事实。这些不是你可以推理的东西，你只能要么知道要么不知道这些特定事实。O3系列的模型在该基准上得分约为50%，而GPT-4.5跃升至约65%。换句话说，对于上一代模型不知道的内容，它掌握了其中的三分之一。

It really just measures, do you know a ton of esoteric facts? And they're not things that you can really reason about. You either just have to know or don't know these particular facts. The o three class of models got about a 50% on that benchmark, and GPT 4.5 popped up to, like, 65%. So in other words, it basically, of the things that were not known to the previous generation of models, it picked up a third of them.

Speaker 0

显然还有三分之二有待提升，但我会说这是一个相当显著的飞跃，对吧？这些都是超长尾问题。我认为大多数人可能得分接近零，就像在问答之夜坐在那里的人，一晚上可能只答对一题，这是我预期大多数人在Simple QA上的表现。

Now there's obviously still two thirds more to go, but I would say that's a pretty significant leap. Right? These are super long tail questions. I would say most people would get, like, close to a zero. You'd be like the person sitting there at the trivia night who, like, maybe gets one a night is kind of what I would expect most people to do on simple QA.

Speaker 0

这很合理，对吧？显然，模型在事实和世界一般信息方面比我们知道得多得多。所以至少可以说，GPT-4.5知道得更多。更大的模型能够吸收更多事实。

And that checks out. Right? Like, obviously, the models know a lot more than we do in terms of facts and just general information about the world. So at a minimum, you can say that g p d four point five knows a lot more. A bigger model is able to absorb a lot more facts.

Speaker 0

定性上，人们也说在某些方面它可能更适合创意写作。它从未接受过与GPT-5相同强度的后期训练，所以我们没有真正的同类比较，但人们确实发现它有一些用处。我认为他们将其下线并全力投入GPT-5的原因可能是那个模型真的很大，运行成本昂贵。

Qualitatively, people also said, in some ways, maybe it's better for creative writing. It was never really trained with the same power of post training that g p d five has had. And so we don't really have an apples to apples comparison, but people do did still find some utility in it. I think maybe the way to understand why they've taken that offline and gone all in on g p d five is just that model's really big. It's expensive to run.

Speaker 0

价格要高得多，比GPT-5高出整整一个数量级还多。对他们来说，消耗运行它所需的所有计算资源可能不值得。也许他们发现人们目前对稍小的模型已经足够满意。我不认为这意味着我们永远不会看到一个具备所有推理能力的更大GPT-4.5模型。

The price was, like, way higher. It was a full order of magnitude plus higher than g p t five is. And it's maybe just not worth it for them to consume all the compute that it would take to serve that. And maybe they just find that people are happy enough with the somewhat smaller models for now. I don't think that means that we will never see a bigger GPT 4.5 model with all that reasoning ability.

Speaker 0

我预计那会带来更多价值，尤其是如果你真的出去尝试做那些冷门的事情，比如推动科学前沿或其他什么。但与此同时，当前的模型非常聪明，你也可以给它们输入大量上下文。这是上一代以来改进最大的方面之一。当GPT-4推出时，至少我们作为公共用户拥有的版本只有8000个token的上下文，相当于大约15页文本，所以当时是有限的。

And I I would expect that that would deliver more value, especially if you're really going out and trying to do esoteric stuff that's, you know, pushing the frontier of science or what have you. But in the meantime, the current models are really smart, and you can also feed them a lot of context. That's one of the big things that has improved so much over the last generation. When GPT four came out, at least the version that we had as public users was only 8,000 tokens of context, which is like 15, you know, pages of of text. So you were limited.

Speaker 0

你甚至不能放进去，比如，几篇论文。你会让上下文溢出。而这正是提示工程最初开始流行的地方。就像，天哪，我真正能提供的信息就这么一点点。我必须非常小心地选择提供哪些信息，以免让它溢出，导致它无法处理。

You couldn't even put in, like, a couple papers. You would be overflowing the context. And this is where prompt engineering initially kinda became a thing. Was like, man, I've really only got such a little bit of information that I can provide. I gotta be really careful about what information to provide, lest I overflow the thing and it just can't handle it.

Speaker 0

随着上下文窗口的扩展，也出现了一些模型版本，它们名义上可以接受更多内容，但实际上无法有效使用。你知道，它们在API调用层面似乎能容纳这些内容，但模型会失去回忆能力，或者在处理越来越长的上下文时会逐渐崩溃。现在你显然有了更长的上下文，并且对它的掌控真的非常非常好。所以你可以用Gemini的最长上下文窗口处理几十篇论文，它不仅会接受它们，还会进行相当深入的推理，并且对这些输入保持非常高的保真度。所以我认为，这种技能在某种程度上替代了模型本身了解事实的需求。

There were also, as context windows got extended, there were also versions of models where they could nominally accept a lot more, but they couldn't really functionally use them. You know, they sort of could could fit them, you know, at the API call level, but they the models would lose recall or they they'd sort of unravel as they got into longer and longer context. Now you have obviously much longer context, and the command of it is really, really good. So you can take dozens of papers on the longest context windows with Gemini, and it will not only accept them, but it will do pretty intensive reasoning over them and with really high fidelity to those inputs. So that skill, I think, does kinda substitute for the model knowing facts itself.

Speaker 0

你可以说，天哪，我们试着把所有这些东西训练进模型里。我们需要，你知道，一万亿或者谁知道，五万亿，不管多少万亿参数来容纳所有这些超长尾事实。或者你可以说，嗯，一个更小的模型如果非常擅长处理提供的上下文，只要人们花时间或者不嫌麻烦地提供必要信息，我就可以通过这种方式访问相同的事实。所以你有一种选择：我是想在这方面加大规模，把所有东西都内建到模型里，还是只想从一个更小、更紧凑的模型中榨取尽可能多的性能？看起来他们选择了后者。

You could say, jeez, we let's try to train all these facts into the model. We're gonna need, you know, a trillion or who knows, 5,000,000,000,000, however many trillion parameters to fit all these super long tail facts. Or you could say, well, a smaller thing that's really good at working over provided context can if people take the time or, you know, go to the trouble of providing the necessary information, I can kind of access the same facts that way. So you have a kind of do I wanna push on this size, and do I wanna bake everything into the model, or do I wanna just try to get as much performance out of a smaller, tighter model that I have? And it seems like they've gone that way.

Speaker 0

我认为基本上只是因为他们在那条梯度上看到了更快的进展。你知道吗？就像模型本身在训练过程中总是朝着改进迈出一小步一样，模型架构的外循环、训练运行的性质以及他们将计算资源投入的方向也在朝着那个方向发展。他们总是在看，比如，我们可以在这方面扩大规模，也许能获得这种好处一点点，或者我们可以在这里做更多的训练后工作来获得那种好处。看起来我们从训练后工作和推理范式中获得的好处比扩大规模更多。

And I I think basically just because they're seeing faster progress on that gradient. You know? In the same way that the models themselves are always kind of in the training process taking a little step toward improvement, you know, the the outer loop of the model architecture and the the nature of the training runs and where they're gonna invest their compute is also kind of going that direction. And they're always looking at, like, well, we could scale up over here and maybe get this kind of benefit a little bit, or we could do more post training here and get this kind of benefit. And it just seems like we're getting more benefit from the post training and the reasoning paradigm than scaling.

Speaker 0

但我认为两者都不是，我绝对不认为其中任何一种已经过时。我们还没有看到4.5版本经过所有这些训练后会是什么样子。是的。所以，嗯，我是说，最近几周的一个大新闻是，我们有多家公司的纯推理模型在没有工具访问的情况下获得了IMO金牌。而且，你知道，这与GPT-four在数学方面的能力相比简直是天壤之别。

But I don't think either one is, I I definitely don't think either one is is dead. We haven't seen yet what 4.5 with all that post training would look like. Yeah. And and so Well, I mean, a big one from just the last few weeks was that we had an IMO gold medal with pure reasoning models, with no access to tools from multiple companies. And, you know, that is night and day compared to what GPT-four could do with math.

Speaker 0

对吧？这些东西真的很奇怪。就像，我在这里说的任何话都不应该暗示人们无法找到模型的弱点。直到今天，我仍然使用一个井字棋谜题，我拍一张井字棋棋盘的照片，其中一个玩家走了一步错棋，不是最优的，从而让另一个玩家可以强制获胜。然后我问模型，是否有人可以从这个位置强制获胜。

Right? We these things are really weird. Like, it's nothing I say here should be, intended to suggest that people won't be able to find weaknesses in the models. I still use a tic tac toe puzzle to this day where I take a picture of a tic tac toe board where some the one of the players has made a wrong move, that is not optimal and thus allows the other player to force a win. And I ask the models if somebody can force a win from this position.

Speaker 0

只有最近，只有最新一代的模型才开始有时能答对。在此之前，它们几乎总是说，井字棋是一个已解决的游戏。你知道，你总是可以平局。它们会错误地评估我的棋盘位置，认为玩家仍然可以平局。所以有很多奇怪的事情。

Only very recently, only the last generation of models are starting to get that right some of the time. Almost always before, they were like, tic tac toe is a solved game. You know, you can always get a draw. There's and they would wrongly assess my board position as the player can still get a draw. So there's a lot of weird stuff.

Speaker 0

对吧？能力的前沿仍然是一个真正的问题，人们肯定会遇到高峰和低谷。但GPT-4刚推出时，根本无法处理接近IMO（国际数学奥林匹克）级别的题目，它当时还在高中数学上挣扎。而从那以后，我们看到了它在高中数学上的进步，一路攀升至IMO金牌水平。

Right? The the jagged, capabilities frontier remains a a real issue, and people are gonna find, you know, peaks and valleys for sure. But g p t four, when it first came out, couldn't do anything approaching IMO goal problems. It was still struggling on, like, high school math. And since then, we've seen this high school math progression all the way up through the IMO gold.

Speaker 0

现在我们有了前沿数学基准测试，我认为现在已经达到了25%左右，大约一年前甚至不到一年前还只有2%。而且就在今天，我们还看到了一些消息，我还没完全消化，但有人刚刚宣布他们解决了一个超级挑战性的经典问题，这个问题甚至是陶哲轩提出的。这件事大概只花了几天或几周的时间，而专业数学家——不仅仅是普通数学家，而是世界顶尖的头脑——花了18个月才在这些问题上取得进展。所以，是的，我认为这确实是很难忽视的能力飞跃。

Now we've got the frontier math benchmark that is, I think, now, like, up to 25%. It was 2% about a year ago or even a little less than a year ago, I think. And we also just today saw something where, and I haven't absorbed this one yet, but somebody just came out and said that they had solved a a, you know, canonical super challenging problem that no less than Terence Tao had put out. And it was like this, you know, this thing happened in, I think, days or weeks of the model running versus it was eighteen months, you know, that it took professional not just any professional mathematicians, but, like, really, you know, the leading minds in the world to make progress on these problems. So, yeah, I think that's really, you know, that's that's really hard jumping capabilities to miss.

Speaker 0

我还经常思考谷歌的AI科学家项目，我们做过一期节目。如果你想了解完整故事，可以去看看。但基本上，他们只是把科学方法分解成了一个框架。你知道吧？当人们说一件事时，很多时候就是这样。

I also think a lot about the Google AI coscientist, which we did an episode with. We can you can check out the full story on that if you want to. But, you know, they basically just broke down the scientific method into a schematic. You know? And this is a lot of what happens when people there's one thing to say.

Speaker 0

模型会通过思考来回应，它会经历推理过程。而且，你知道，它在运行时消耗的token越多，答案就会越好。这是真的。然后你还可以在此基础上搭建脚手架，说，好吧，让我把像科学方法这样广泛且具有抱负的东西分解成几个部分。

The model will respond with thinking, and it'll go through reasoning process. And, you know, the more tokens it it spends at runtime, the better your answer will be. That's true. Then you can also build this scaffolding on top of that and say, okay. Well, let me take something as broad and, you know, aspirational as the scientific method, and let me break that down into parts.

Speaker 0

好吧。有假设生成，然后是假设评估，接着是实验设计，还有文献综述。

Okay. There's hypothesis generation. Then there's hypothesis evaluation. Then there's, you know, experiment design. There's literature review.

Speaker 0

科学方法有所有这些部分。谷歌团队所做的是创建了一个相当精细的框架，代表了他们对科学方法的最佳分解，为每个步骤优化了提示，然后给出了这个 resulting 系统，现在它在推理上以两种方式扩展。它既是思维链，也是团队构建的所有这些不同攻击角度。他们给了它科学中真正未解决的问题。在一个特别著名甚至有点臭名昭著的案例中，它提出了一个假设，但由于无法直接访问实验室来运行实验，它无法验证。

There's all these parts of the scientific method. What the team at Google did is created a pretty elaborate schematic that represented their best breakdown of the scientific method, optimized prompts for each of those steps, and then gave this resulting system, which is scaling inference now kind of two ways. It's both the chain of thought, but it's also all these different angles of attack structured by the team. And they gave it legitimately unsolved problems in science. And in one particularly famous kind of notorious case, it came up with a hypothesis, which it wasn't able to verify because it doesn't have direct access to actually run the experiments in the lab.

Speaker 0

但它针对某个病毒学中的开放问题提出了一个假设，这个问题已经困扰科学家多年了，而碰巧的是，科学家们最近也找到了答案但尚未发表结果。所以就有了这样一个交汇点：科学家们通过实验验证了，而以AI科学家形式出现的Gemini得出了完全正确的答案。这些都是之前 literally 没人知道的事情。而GPT-4根本做不到这一点。你知道吧？

But it came up with a hypothesis to some open problem in virology that had stumped scientists for years, and it just so happened that they had also recently figured out the answer but not yet published their results. And so there was this confluence where the scientists had experimentally verified, and Gemini, in the form of this AI coscientist, came up with exactly the right answer. And these are things that, like, literally nobody knew before. And g p GPT four just wasn't doing that. You know?

Speaker 0

我的意思是，这些是质变的新能力。那个东西，我想，运行了好几天。你知道，运行推理可能花费了数百美元，甚至可能上千美元。这可不是小数目，但也比，你知道，多年培养研究生要便宜得多。如果你能解决那种级别的问题并真正获得好的解决方案，那么，你知道，你愿意为此类成果支付多少钱呢？

I mean, these are, qualitatively new capabilities. That thing, I think, ran for days. You know, it probably cost hundreds of dollars, maybe into the thousands of dollars to run the inference. You know, that's not nothing, but it's also, like, very much cheaper than, you know, years of grad students. And if you can get to those caliber of problems and actually get good solutions to them, like, you know, what would you be willing to pay, right, for that kind of thing?

Speaker 0

所以，是的，我不确定。这可能还不是完全的理解。我们可以讨论很久，但总结来说，GPT-4并未能真正推动人类知识的前沿。据我所知，我不认为它曾发现过任何新东西。要从GPT-5或Gemini 2.5或Claude Opus 4之类的模型中获取那种输出仍然不容易，但这种情况已经开始偶尔发生了。

So, yeah, I don't know. That's probably not a full appreciation. We could go on for a long time, but I would say, in in summary, g p t four was not able to push the actual frontier of human knowledge. I don't to my knowledge, I don't know if it ever discovered anything new. It's still not easy to get that kind of output from a GPT five or a Gemini 2.5 or, you know, a Claude Opus four or whatever, but it's starting to happen sometimes.

Speaker 0

而这本身就是一个巨大的突破。

And that in and of itself is a a huge deal.

Speaker 2

那么，我们如何解释围绕GPT-5的看跌情绪或氛围转变呢？一个可能的因素是，如果改进是在前沿领域，并非每个人都在日常工作中处理高级数学和物理，因此他们可能不会像ChatGPT的飞跃那样，在日常生活中明显感受到其好处，并影响日常体验。

Well, then how do we explain the the bearishness or the kind of vibe shift around GPT-five then? One potential contributor is this idea that if the improvements are at the frontier, not everyone is working with you know, sort of advanced math and physics in a in a day to day, and so maybe they don't see the benefits of it in their day to day lives in the in the same way that, you know, sort of the jumps in ChadGBT were were were were obvious and and shaped the day to day.

Speaker 0

是的。我的意思是，我认为很大程度上是他们搞砸了发布，简单来说。对吧？他们当时在推特上发了死星图片，后来Sam Altman回来澄清说，不，你们才是死星。

Yeah. I mean, I think a decent amount of it was that they kinda fucked up the launch, you know, simply put. Right? They, like, were tweeting death star images, which Sam Altman later came back and said, no. You're the death star.

Speaker 0

我不是死星。但人们以为死星指的是模型。那通常是——是的。期望被设定得极高。而实际发布本身在技术上就出了问题。

I'm not the death star. But I think people thought that the death star was supposed to be the model. That was generally the yeah. The expectations were set extremely high. The actual launch itself was just technically broken.

Speaker 0

所以很多人的第一次GPT-5体验是，他们现在有了这个模型路由器的概念，另一种理解他们做法的方式是，他们试图主导消费者用例。为了做到这一点，他们需要简化产品体验，相对于过去的情况，比如，好吧，你有GPT-4、4o、4o mini、o3、o4 mini等等。你知道，GPT-5也曾出现在其中。你有所有这些不同的模型。

So a lot of people's first experiences of GPT five, they they've got this model router concept now where think one another way to understand what they're doing here is they're trying to own the consumer use case. And to own that, they need to simplify the product experience relative to what we had in the past, which was like, okay. You got g b d four and four o and four o mini and o three and o four mini and other things. You know, four five was in there at one point. You got all these different models.

Speaker 0

我应该用哪个模型处理哪种问题？对于大多数不痴迷于此的人来说，这确实很令人困惑。因此，他们想要做的一件大事就是简化这个过程——只需提出你的问题，就能得到好的答案。而我们作为产品所有者，会承担这种复杂性。有趣的是，虽然我没有很好的解释，但你可能想做的是合并这些模型，让模型自己决定需要思考多少，或者甚至让模型自己决定需要使用多少专家——如果它是混合专家架构的话。

Which one should I use for which? It's, like, very confusing to most people who aren't obsessed with this. And so one of the big things they wanted to do was just shrink that down to just ask your question, and you'll get a good answer. And we'll take that complexity on our side as the the product owners to do that. Interestingly and I I don't have a great account of this, but one thing you'd you might wanna do is kind of merge the models and figure out just have the model itself decide how much to think, or maybe even, you know, have the model itself decide how many of its experts, if it's a mixture of experts architecture, it needs to use.

Speaker 0

或者，你知道，有很多不同的研究项目是关于跳过模型层的。如果任务足够简单，你可以跳过很多层。所以你可能会希望，在后台，你能真正将所有不同的模型合并成一个模型，该模型会根据用户查询所呈现的挑战级别动态使用适量的计算资源。看起来他们发现这比预期的要困难。因此，他们提出的解决方案是设置一个路由器，路由器的任务是选择。

Or maybe, you know, there's been a bunch of different, research projects on, like, skipping layers of the model. If the task is easy enough, you could, like, skip a bunch of layers. So you might have hoped that you could genuinely, on the back end, merge all these different models into one model that would dynamically use the right amount of compute for the level of challenge that a given user query presented. It seems like they found that harder to do than they expected. And so the solution that they came up with instead was to have a router where the the router's job is to pick.

Speaker 0

这是一个简单的查询吗？如果是，我们会把你送到这个模型。是中等难度？还是困难？我认为他们幕后实际上只有两个模型。

Is this an easy query? In which case, we'll send you to this model. Is it a medium? Is it a hard? And I think they just have two really, models behind the scenes.

Speaker 0

所以我认为就是非常简单或非常困难。当然，他们展示的图表基本上显示了‘思考’与‘不思考’的对比。发布时的问题是那个路由器坏了。所以所有的查询都流向了那个‘笨’模型。因此，很多人确实得到了糟糕的输出，比o3还差，因为他们得到的是不思考的回应。

So I think it's just really easy or hard. Certainly, the graphs that they showed, you know, basically showed the kind of with and without thinking. The problem at launch was that that router was broken. So all of all of the queries were going to the dumb model. And so a lot of people literally just got bad outputs, which were worse than o three because they were getting nonthinking responses.

Speaker 0

所以最初的反馈是，好吧，这很蠢，这种看法传播得非常快。我认为这定下了基调。我现在的感觉是，随着尘埃落定，大多数人确实认为它是最好的可用模型。而且，你知道，像那个臭名昭著的米制任务长度图表，它是最好的。

And so the initial reaction of like, okay. This is dumb, that sort of, you know, traveled really fast. I think that kind of set the tone. My sense now is that as the dust has settled, most people do think that it is the best model available. And, you know, things like the meter, the infamous, meter task length chart, it is the best.

Speaker 0

你知道，我们现在已经超过两个小时了，它仍然在趋势线之上。所以，如果你只是说，我相信图表上的直线吗，这个最新的数据点应该如何影响我对这些在对数幂尺度图上的直线定律的看法，它不应该太改变你的想法。它仍然在趋势线之上。我和Zvi谈过这个。Zvi Mowshowitz，传奇的信息吞噬者和人工智能行业分析师，在最近的一个播客中也问了他同样的问题。

You know, we're now over two hours, and, it is still above the trend line. So if you just said, you know, do I believe in straight lines on graphs or not, and how should this latest data point influence whether I believe on these straight law lines on, you know, power logarithmic scale graphs, it shouldn't really change your mind too much. It's still above the trend line. I talked to Zvi about this. Zvi Mashwitz, legendary info vore and AI industry analyst on a recent podcast too, and kinda asked him same question.

Speaker 0

比如，你认为为什么即使是一些最了解内情、最敏锐的业内人士似乎也因此将时间线推后了一点？他的回答基本上是，它解决了一些不确定性。你知道，你之前有一个悬而未决的问题：也许他们确实有另一个突破。也许它真的是‘死星’，你知道，如果他们在上行方向给我们惊喜，那么所有这些短时间线……我们本可以期待一个……是的。我想一种思考方式是，时间线上的分布原本是相当广泛的。

Like, why do you think the you know, even some of the most plugged in, you know, sharp, minds in the space have seemingly pushed timelines out a bit as a result of this? And his answer was basically just it resolved some amount of uncertainty. You know, you had open question of maybe they do have another breakthrough. You know, maybe really is the Death Star, you know, if they surprise us on the upside, then all these short timeline you know, we we could have expected a yeah. I guess one one way to think about it is, the the distribution was sort of broad in terms of timelines.

Speaker 0

而如果他们带来了超出预期的惊喜，概率分布可能会向前端收窄再收窄。而如果他们低于预期，或者仅仅是符合趋势，那么你就会从时间线的最短端取出部分分布，将其推向中间或末端。所以他的回答就像是，2027年实现AI的可能性变小了，但2030年实现AI的可能性基本没有降低。甚至可能还略微增加了，因为早期年份的一些概率质量现在转移到了那里。所以并不是说我认为人们没有大幅推迟整个分布。

And if they had surprised on the upside, it might have narrowed and narrowed in toward the front end of the distribution. And if it if they surprised on the downside or even just were, you know, purely on trend, then you would take some of your distribution from the very short end of the timelines and kinda push them back toward the middle or the end. And so his answer was like, AI twenty twenty seven seems less likely, but AI twenty thirty seems basically no less likely. Maybe even a little more likely because some of the the probability mass from the early years is now sitting there. So it's not that I don't think people are are moving the whole distribution out super much.

Speaker 0

我认为他们可能更多只是在收窄这个分布，你知道，它变得稍微更紧凑了，因为它可能不会像之前看起来那么快发生。但我不认为太多人，至少我认为真正深入了解此事的人，会大幅推迟到2030年之后。顺便说一下，显然存在很多分歧。我思考这类事情的方式一直是：Dario说是2027年，Demis说是2030年。

I think they're maybe more just kind of shrinking the, you know, it's getting a little tighter because it's maybe not happening quite as soon as it seemed like it might have been. But I don't think too many people, at least that I, you know, think are really plugged in on this, are pushing out too much past 2030 at all. And by the way, you know, the obviously, there's a lot of, you know, disagreement. The way I kind of have always thought about this sort of stuff is Dario says 2027. Demis says 2030.

Speaker 0

我就把这个当作我的范围。所以在GPT-5发布前，我大致处于这个区间，现在我会说，嗯，我不知道。Dario手里还有什么牌？你知道，他们刚刚发布了4.1 Opus。在那篇博客文章中，他们说，我们将在未来几周内发布更强大的模型更新。

I'll take that as my range. So coming into g p t five, I was kind of in that space, and now I'd say, well, I don't know. Dario's got what what cards does he have up his sleeve? You know, they just put out 4.1 Opus. And in that blog post, they said, we will be releasing more powerful updates to our models in the coming weeks.

Speaker 0

所以他们很快就会有新动作。也许这次他们会带来惊喜，或者谷歌会。我不会说2027年完全不可能。但是，是的，我会说2030年看起来仍然和以前一样可能。而且，从我的角度来看，那仍然非常快。

So they're due for something pretty soon. You know, maybe they'll be the ones to surprise on the upside this time, or maybe Google will be. I wouldn't say 2027 is is out of the question. But, yeah, I would say twenty twenty thirty still looks just as likely as before. And, again, from my standpoint, it's like, that's still really soon.

Speaker 0

对吧？所以无论是在28年、29年还是30年走上正轨，我其实不太在意。我努力规划自己的工作，以便让自己做好准备，并帮助他人为可能的最极端情景做好准备，这就像是，如果我们目标定得高，即使稍有偏差，我们还有更多时间，那很好。我相信我们会有很多事情可做，利用那额外的时间来准备好迎接任何强大的AI上线。但是，是的，我想我的世界观并没有因此改变太多。

You know? So if we're on track whether it's '28, '29, 30, I don't really care. I I I try to frame my own work so that I'm kind of preparing myself and helping other people prepare for what might be the most extreme scenarios and kind of, you know, one of these things where if we aim high and we miss a little bit and we have a little more time, great. I'm sure we'll have plenty of things to do to use that extra time to be ready for, you know, whatever powerful AI does come online. But, yeah, I guess I don't my worldview hasn't changed all that much as a result of

Speaker 2

这些夏季的进展。 anecdotal地说，我没有像以前那样频繁听到关于AI 2027或情境感知的讨论了。我确实和一些刚刚将其推迟几年的人聊过，正如你所说。是的，Durkash有他那一套说法，你知道，他仍然相信它，但可能是因为持续学习方面的差距或类似的原因，你知道，也许它的扩散会慢一些。而且，正如你提到的，Meter的论文显示工程师的生产力降低了，所以也许在未来几年内，人们对大规模被取代的担忧减少了。

these summer developments. Anecdotally, I I don't hear as much about AI 2027 or situational awareness to the to the same degree. I I do talk to some people who've just moved it a a few years back to to your point. Yeah, Durkash had his whole thing around you know, he still he still believes in it, but sort of, you know, maybe because this gap in continual learning or or or or something to the effect that, you know, maybe it's just gonna be a bit slower to to diffuse. And, you know, Meter's paper, as as you mentioned, showed that engineers are less productive, and so maybe there's there's less of a sort of concern around, you know, people being replaced to the next next few years in in mass.

Speaker 2

我想我们大约一年前讨论这个时，你好像说过类似50%的工作岗位。我很好奇这是否仍然是你的试金石，或者你现在怎么看待它。

I think when we spoke maybe a year ago ago about this, I think you said something like 50% of 50% of jobs. I'm I'm curious if that's still your your your litmus test or how you think about it.

Speaker 0

嗯，有一点，我认为METER的那篇论文值得再深入探讨一下，因为这是那种...我本人是METER的忠实粉丝，对他们没有任何偏见，因为我认为做科研、发表成果是好事。你不必让每个实验结果和发布的内容都符合某种叙事。但我确实觉得，对于那些想要说'这一切都是胡扯'的人来说，这篇论文太容易被他们拿来作为依据了。而且，这里还有一些东西我会归入Cal Newport的范畴——对我来说最有趣的可能是用户认为自己变快了，但实际上似乎更慢了。

Well, one thing, I think that meter paper is worth unpacking a little bit more because this was one of those things that was and I I I'm a big fan of METER, and I have no, you know, no shade on them because I do think, do science, publish your results. Like, that's good. You don't have to make every experimental result and everything you put out conform to a narrative. But I do think it was a little bit it was a little bit too easy for people who wanted to say that, oh, this is all nonsense to latch on to that. And, you know, again, there's there's something there that I would kinda put in the Cal Newport category too where, for me, maybe the most interesting thing was the users thought that they were faster when in fact they seemed to be slower.

Speaker 0

所以这种对自我的错误认知，我认为真的很有意思。个人认为，对此有一些解释，包括：启动代理后就去刷社交媒体，滚动浏览一会儿再回来——可能在我回来之前任务早就完成了。所以老实说，一个非常简单且我们开始在产品中看到的方法是：产品可以通过提供通知来解决这些问题。

So that sort of misperception of oneself, I think, is really interesting. Personally, I think there's some explanations for that that include, like, hitting go on the agent, going to social media, and scrolling around for a while, and then coming back. The thing might have been done for quite a while by the time I get back. So, honestly, one, like, really simple and we're starting to see this in products. One really simple thing that the products can do to address those concerns is just provide notifications.

Speaker 0

比如'任务已完成'的通知。这样你就会停止刷社交媒体，回来检查它的工作。就纯粹的时间消耗而言...了解他们打开了哪些应用程序会很有趣。也许他们用Cursor比手动操作花的时间稍长，但有多少时间是Cursor作为活动窗口，又有多少时间是他们在等待时被其他随机事物分散了注意力。

Like, the thing is done now. So, you know, stop scrolling and and come back and check its work. That in terms of just clock time. You know, it would be interesting to know, like, what applications did they have open. Maybe they took a little longer with Cursor than doing it on their own, but how much of the time was Cursor the active window, and how much of it was, you know, some other random distraction while they were waiting.

Speaker 0

但我认为该研究的一个更根本的问题（再次说明，这不是研究设计的问题，而是在解读和消化过程中丢失了一些细节）在于：他们基本上是在模型已知最无法提供帮助的领域测试了模型或产品Cursor。这项研究是今年初进行的，使用的是几个版本前的模型（取决于你怎么计算版本），针对大型代码库——这又涉及到上下文窗口的限制，这是一直在推进的技术前沿——这些是非常成熟的代码库，有很高的编码标准，开发者对自己的代码库非常熟悉，对这些特定代码库进行过大量提交。所以我认为这基本上是为AI设置的最困难场景，因为人们非常了解自己的领域。

But I think a more fundamental issue with that study, which, again, wasn't really about the study design, but just in in the sort of, you know, interpretation and kind of digestion of it, some of these details got lost. The they basically tested the models or the, you know, the product cursor in the area where it was known to be least able to help. This study was done early this year. So it was done with, you know, a kind of one depending on how you wanna count, right, a couple couple releases ago with code bases that are large, which, again, strange the context window, and, you know, that's one of the the frontiers that has been moving, very mature code bases with, like, high standards for coding and developers who really know their code bases super well, who've made a lot of commit you know, commits to these particular code bases. So I would say that's basically the hardest situation that you could set up for an AI because the people know, you know, their stuff really well.

Speaker 0

AI却不了解。上下文非常庞大。人们通过长期工作已经吸收了这些知识。AI没有这种知识。而且，这还是几个世代前的模型。

The AI doesn't. The context is huge. People have already absorbed that through working on it for a long time. The AI doesn't have that that knowledge. And, again, a couple generations ago, models.

Speaker 0

还有一个重要因素是用户对这些工具不太熟悉。为什么？因为当时这些工具还无法真正帮助他们。我认为很多参与研究的人的心态是：'我没怎么用过这个工具，因为它似乎没什么用'。考虑到当时的局限性，他们的这个评估并没有错。

Then And a big thing too is that the user the people were not very well versed in the tools. Why? Because the tools weren't really able to help them yet. I think the sort of mindset of the people that came into the study in many cases was like, well, I haven't used this all that much because it hasn't really seemed to be super helpful. They weren't wrong in that assessment given the, you know, the limitations.

Speaker 0

从METER团队给用户的一些指导和支持中可以看出这一点。论文中提到，如果他们发现你没有很好地使用Cursor，他们会给你一些反馈教你如何更好地使用。他们告诉用户要做的一件事是：确保你@标记特定文件以将其纳入模型的上下文，这样模型就有了正确的上下文。这基本上是在Cursor中最基本的操作，就像你在使用它的第一个小时、第一天就会学到的东西。

And you could see that in terms of the some of the instructions and the help that the METER team gave to people. One of the things that is in the paper that they would if they noticed that you were, like, weren't using Kirster super well, they would give you some feedback on how to use it better. One of the things that they were telling people to do is make sure you at tag a particular file to bring that into context for the model so that the model has, you know, the right context. And that's literally, like, the most basic thing that you would do in cursor. You know, that's, like, the thing you would learn on your in your first hour, your first day of using it.

Speaker 0

所以这确实表明，虽然这些人是非常有能力的程序员，但在使用AI工具方面基本上大多是新手。因此我认为这个结果是真实的，但我只是会非常谨慎地对此进行过度概括。至于，我想，另一个问题是什么？是关于工作的预期吗？我的意思是，我们已经开始看到一些迹象了。

So it really does suggest that these were, you know, while very capable programmers, like, basically mostly novices when it came to using the AI tools. So I think the result is real, but I just I would be very cautious about generalizing too much there. In terms of I guess, what what else what what was the other question? It what is the expectation for jobs? I mean, we're starting to see some of this.

Speaker 0

对吧？我们确实看到，马克·贝尼奥夫说过，因为他们现在有了AI代理，他们已经能够削减一大批人员。他们正在回应每一个潜在客户。当然，克拉纳一段时间以来也一直在说类似的事情。我认为他们在某种程度上也被误报了，比如他们正在退缩，因为他们实际上会保留一些客服人员，而不是全部裁掉。

Right? We are definitely seeing no less than, Mark Benioff has said that they've, you know, they've been able to cut a bunch of headcount because they've got AI agents now. They're responding to every lead. Klarna, of course, has, you know, has said, you know, very similar things for a while now. They also, I think, have been a little bit misreported in terms of, like, oh, they're backtracking off of that because they're they're actually gonna keep some customer service people, not none.

Speaker 0

我认为这有点反应过度。比如，他们可能有一些客户坚持要某种体验，也许他们想提供这种服务。这是有道理的。你知道，我认为你可以为客户提供一系列的服务选项。我曾经为一个SaaS公司编写过一个定价页面，实际上我就是即兴编写的，基本层级是AI销售和服务，一个价格。

And I think that's a bit of an overreaction. Like, they may have some people who are just insistent on having a certain experience, and maybe they wanna provide that. And that makes sense. You know, it doesn't I think you can have a a spectrum of service offerings to your customers. I once coded up a pricing page for a set and I actually just vibe coded up a pricing page for a SaaS company that was, like, basic, level with AI sales and services, one price.

Speaker 0

如果你想和人类销售交谈，那是更高的价格。如果你想和人类销售及支持交谈，那是，你知道，第三个更高的价格。所以，我认为，在某些情况下，实际情况可能就是这样的，而且对人们来说，这很可能是一个非常明智的选择。但我确实看到了Intercom的情况，我有一期节目即将播出，他们现在有一个Fin代理，解决了大约65%的客户服务工单。那么，这对工作岗位会有什么影响呢？

If you wanna talk to human sales, that's a higher price. And if you wanna talk to human sales and support, that's a, you know, third higher higher price. And so, like, literally, that might be what's going on, I think, in some of these cases, and it it could very well be a very sensible option for people. But I just I do see the Intercom, I've got an episode coming up with, they now have this Fin agent that is solving, like, 65% of customer service tickets that come in. So, you what's that gonna do to jobs?

Speaker 0

真的有，比如，三倍多的客户服务工单需要处理吗？我不知道。我认为供应是相对缺乏弹性的。也许如果人们期望得到更好、更快的答案，工单会稍微多一些，但我不认为我们会看到三倍多的工单。顺便说一下，那个数字在三、四个月前大约是55%。

Are there really, like, three times as many customer service tickets to be handled? Like, I don't know. I think there's kind of a relatively inelastic supply. Maybe you get somewhat more tickets if people expect that they're gonna get better, faster answers, but I don't think we're gonna see, like, three times more tickets. By the way, that number was, like, fifty five percent three or four months ago.

Speaker 0

所以，你知道，随着他们逐步提高这个比例，比例关系变得非常棘手。对吧？在解决一半工单的情况下，理论上，也许你会得到更多工单。也许你不需要大幅调整人员编制。但当你达到90%的工单解决率时，你真的会有10倍多的工单，或者10倍多需要人工处理的棘手工单吗？

So, you know, as they ratchet that up, the ratios get really hard. Right? At at half ticket resolution, In theory, maybe you get some more tickets. Maybe you don't need to adjust headcount too much. But when you get to 90% ticket resolution, you know, are you really gonna have 10 times as many tickets or 10 times as many hard tickets that the people have to handle?

Speaker 0

这似乎真的很难想象。所以我认为在很多环境中，这些岗位可能不会完全消失，但我确实预计你会看到很多地方出现显著的人员裁减。软件方面真的很有趣，因为弹性是未知的。你知道，你有可能为每个用户，或者每个光标用户，或者你公司的每个开发者，生产出x倍多的软件。但也许你想要那样。

It seems just really hard to imagine that. So I don't think I don't think these things go to zero probably in a lot of environments, but I do expect that you will see significant headcount reduction in a lot of these places. And the software one is really interesting because the elasticities are really unknown. You know, you can potentially produce x times more software per user or, you know, per per cursor user or per developers at your company, whatever. But maybe you want that.

Speaker 0

你知道，也许没有限制，或者说，也许我们所在的体制是这样的：如果生产力提高10倍，那也是好事。而且，我们仍然有同样多的工作岗位，因为我们想要10倍多的软件。我不知道这种情况能持续多久。再说，比例在某个时候会变得具有挑战性。但是，是的，我想起了泰勒·考恩的那个老观点。

You know, maybe there is no limit or no you know, maybe the the regime that we're in is such that if there's, you know, 10 times more productivity, that's also the good. And, you we still have just as many jobs because we want 10 times more software. I don't know how long that lasts. Again, the ratios start to get challenging at some point. But, yeah, I think the bottle you know, the old Tyler Cowen thing comes to mind.

Speaker 0

你是一个瓶颈。你是一个瓶颈。我认为更多时候，问题是人们是否真的在努力充分利用这些东西？他们是否使用了最佳实践？他们是否真的用心投入了？

You are a bottleneck. You are a bottleneck. I think more often, it is, are people really trying to get the most out of these things? And, you know, are they using best practices? And have they, have they really put their minds to it or not?

Speaker 0

而且，你知道，通常真正的障碍就在那里。我一直在与一家公司合作，他们基本上在做政府文件审查。我会稍微避开细节。非常棘手的东西，比如扫描的文件、手写的、填写的表格。他们创建了这个审计AI代理，刚刚赢得了一个州级合同，每年审计大约一百万笔这类文件包，又是扫描的、手写的，所有这些乱七八糟的东西。

And, you know, often, the the real barrier is there. I was I've been working a little bit with a company that is doing, basically, government doc review. I'll obstruct a little bit away from the details. Really gnarly stuff, like scanned documents, you know, handwritten, filling out of forms. And they've created this auditor AI agent that just won a state level contract to do the audits on, like, a million transactions a year of of these, you know, these packets of documents, again, scanned, handwritten, all this kind of crap.

Speaker 0

他们完全超越了之前做这项工作的工人。那么那些工人会去哪里？我不知道。我不认为他们会有10倍多的事务。你知道，我可以相当确信这一点。

And they just blew away the human workers that were doing the job before. So where are those workers gonna go? Like, I don't know. I don't they're not gonna have 10 times as many transactions. You know, I can be pretty confident in that.

Speaker 0

是否还会有少数人留下来监督AI，处理异常情况，接听电话？当然。也许他们不会去任何地方。你知道，州政府可能会做一件奇怪的事，就让所有那些人闲着，因为他们不忍心解雇他们。谁知道最终决定会是什么？

Are there gonna be a few still that are there to supervise the AIs and handle the weird cases and, you know, answer the phones? Sure. Maybe they maybe they won't go anywhere. You know, the the state, you know, the state may do a a strange thing and, just have all those people, like, sit around because they can't bear to fire them. Like, who knows what the ultimate decision will be?

Speaker 0

但我确实看到很多这样的情况，我只是想，当你真正用心去做，并确定什么能为我们创造真正的杠杆时，AI能做到吗？我们能让它工作吗？在当今世界，你可以非常可靠地从高容量任务中削减很大一部分。所以我认为，这些影响已经开始在很多工作中显现。人类，我认为，领导力可能是瓶颈，或者很多地方的意愿可能是瓶颈。

But I do see a lot of these things where I'm just like, when you really put your mind to it and you identify what would create real leverage for us, can the AI do that? Can we make it work? You can take a pretty large chunk out of high volume tasks, very reliably in today's world. And so the the impacts, I think, are starting to be seen there on on a lot of jobs. Humans, I think, are you know, the leadership is maybe the bottleneck or the the will in in a lot of places might be the bottleneck.

Speaker 0

软件可能是一个有趣的案例，也许因为有如此多的积压需求，可能需要更长的时间才能看到这些影响，因为你确实想要10倍或100倍多的软件。

And software might be an interesting case where there is just so much pent up demand perhaps that it may take a little longer to see those impacts because you really do want, you know, 10 or a 100 times as much software.

Speaker 2

是的。我们来谈谈代码，因为你知道，这是Anthropic早期下大赌注的地方，或许是受到某种自动化研究者、递归自我改进的启发，你知道，某种程度上是期望的未来。然后我们看到OpenAI也在那里有所动作。你什么时候会详细阐述这一点，或者谈谈是什么启发了它，以及你

What is yeah. Let's let's talk about code because it's, you know, it's it's where Anthropic made made a big bet early on, you know, perhaps inspired by the sort of automated researcher, you know, recursive self improvement, you know, sort of, know, desired future. And then we saw OpenAI make make moves there as well. When will you flesh that out or or talk a little about what what inspired that and where you

Speaker 0

看到它走向何方？乌托邦还是反乌托邦，我认为这真的是一个大问题。对吧？我的意思是，为什么代码如此受关注，也许一部分是技术原因，两部分是社会原因？技术部分是验证代码真的很容易。

see that going? Utopia or dystopia is really the big question there, I think. Right? I mean, is maybe one part technical, two parts social in terms of why code has been so focal? The technical part is that it's really easy to validate code.

Speaker 0

你生成它。你可以运行它。如果出现运行时错误，你可以立即得到反馈。你知道，进行功能测试有点困难。Replit最近，就在过去48小时左右，发布了他们代理的第三版。

You generate it. You can run it. If you get a runtime error, you can get the feedback immediately. It's, you know, somewhat harder to do functional testing. Replit recently, just in the last, like, forty eight hours, released their v three of their agent.

Speaker 0

现在，除了，你知道，代码、代码、代码，试图让你的应用工作——第二版代理会这样做，它可以持续几分钟，在某些情况下生成几十个文件。我有过一些神奇的体验，当时我想，哇，你只用一次提示就完成了整个事情，效果惊人。其他时候，它会编码一会儿，然后交给你说，好吧，看起来怎么样？

And it now, in addition to, you know, code, code, code, try to make your app work, v two of the agent would do that, and it could go for minutes and, you know, in some cases, generate dozens of files. And I've had some magical experiences with that where I was like, wow. You just did that whole thing in one prompt, it, like, worked amazing. Other times, it will sort of code for a while and hand it off to you and say, okay. Does it look good?

Speaker 0

它在工作吗？然后你说，不，没有。我不确定为什么。你知道，你会和它陷入来回交流。

Is it working? And you're like, no. It's not. I'm not sure why. You know, you get into a a back and forth with it.

Speaker 0

但第二版和第三版的区别在于，它不是把接力棒交还给你，而是现在使用浏览器和模型的视觉方面去尝试自己做质量保证。所以它不只是说，好吧，嘿，我尽力了，写了一堆代码。比如，让我知道它是否工作。

But the difference between v two and v three is that instead of handing the baton back to you, it now uses a browser and the vision aspect of the models to go try to do the QA itself. So it doesn't just say, okay. Hey. I, I tried my best, wrote a bunch of code. Like, let me know if it's working or not.

Speaker 0

它会先尝试弄清楚它是否在工作。而且，你知道，这再次真正改善了飞轮效应，你能做多少，能验证多少，能多快验证它。那个循环的速度对于改进的步伐非常关键。所以这是一个相当适合快速飞轮技术的问题空间。其次，当然，这些地方的人都是程序员，对吧，所以他们想，你知道，解决自己的问题。

It takes that first pass at figuring out if it's working. And, you know, again, that that really improves the flywheel, just how much you can do, how much you can validate, how quickly you can validate it. The the speed of that loop is really key to the pace of improvement. So it's a problem space that's pretty amenable to the sorts of, you know, rapid flywheel techniques. Second, of course, they they're all coders, right, at these places, so they want to, you know, solve their own problems.

Speaker 0

这，嗯，非常自然。第三点，我认为在某种程度上，社会愿景的竞争中，谁知道这一切会走向何方，他们确实想要创造出自动化AI研究员。顺便提一下，这是来自o3系统卡的另一个数据点。他们展示了从低到中个位数百分比跃升至大约40%的PRs实际上由OpenAI的研究工程师检查，模型能够做到这一点。所以在o3之前，几乎没什么进展。

That's, like, very natural. And third, I do think on the, you know, sort of social vision competition, who knows where this is all going, they do wanna create the automated AI researcher. That's another data point, by the way, from this was from the o three system card. They showed a jump from, like, low to mid single digits to roughly 40% of PRs actually checked in by research engineers at OpenAI that the model could do. So prior to o three, not much at all.

Speaker 0

你知道，低到中个位数。到了o3，40%。我敢说那是比较容易的40%或者类似的情况。再次强调，这里面会有一些注意事项。但你可能正进入S曲线的陡峭部分，而且那大概是相当高端的。

You know, low to mid single digits. As of o three, 40%. I'm sure those are the easier 40% or whatever. Again, there will be, you know, caveats to that. But that's you're entering maybe the steep part of the s curve there, and that's presumably pretty high end.

Speaker 0

你知道，我不知道OpenAI有多少简单问题，但可能相对于我们这些一直在做通用网络应用的人来说，并没有那么多。所以，在40%的水平上，我认为你肯定开始涉足一些相当困难的任务，一些相当高价值的东西。你知道，在那个比例真正开始倾斜，AI开始承担大部分工作的时候，是什么时候？值得注意的是，GPT-5在那个特定指标上并没有比o3有大的更新。我的意思是，它也没有回到简单的问答事情上。

You know, I don't know how many easy problems they have in OpenAI, but presumably, you know, not that many relative to the rest of us that are out here making generic web apps all the time. So, you know, at 40%, you gotta be starting to, I would think, get into some pretty hard tasks, some pretty high value stuff. You know, at that at what point does that ratio really start to tip where the AI is, like, doing the bulk of the work? G p d five notably wasn't a big update over o three on that particular measure. I mean, it also wasn't going back to the simple QA thing.

Speaker 0

GPT-5普遍被认为不是相对于4o和o3的规模扩大。你可以在简单的问答指标中看到这一点，它基本上在这些长尾琐碎问题上得分相同。它不是一个吸收了更多世界知识的更大模型。它是，你知道，Cal是对的。我认为这是分析，它是后训练。

GPT-five is generally understood to not be a scale up relative to four o and o three. And you can see that in the simple QA measure, it basically scores the same on these long tail trivia questions. It's not a bigger model that has absorbed, like, lots more world knowledge. It is it is you know, Cal is right. I think it is analysis that it's it's post training.

Speaker 0

但那个后训练，你知道，可能正进入S曲线的陡峭部分，当涉及到甚至能够处理OpenAI在研究工程前沿发生的那些困难问题时。而且，你知道，哎呀。所以老实说，我对此有点担心。这个想法是，我们可以从这些公司拥有几百个研究工程师，一夜之间变成拥有无限的数量，这会对事情能改变多少，以及我们整体引导这个过程的能力意味着什么。我对公司陷入递归自我改进机制的想法并不太舒服，特别是考虑到我们目前在模型中看到的控制水平和不可预测性水平，但这似乎正是他们追求的目标。

But that post training, you know, is potentially entering the steep part of the s curve when it comes to the ability to do even the kind of hard problems that are happening at, at OpenAI other on the research engineering front. And, you know, yikes. So I I'm a little worried about that, honestly. The, the idea that we could go from these companies having a few 100 research engineer people to having, you know, unlimited overnight, and, like, what would that mean in terms of how much things could change, and also just our ability to steer that overall process. I'm not super comfortable with the idea of the companies tipping into a recursive self improvement regime, especially given the the level of control and the level of unpredictability that we currently see in the models, but that does seem to be what they are going for.

Speaker 0

所以，关于为什么，我认为这已经是相当长时间的计划了。甚至你记得那个泄露的Anthropic筹资文件，大概两年前，他们说在2025和2026年，训练最好模型的公司会遥遥领先，以至于没人能赶上。我认为那就是他们的意思。我认为他们当时预测在2526年的时间框架内，他们会得到这个自动化研究员。一旦你有了那个，那些没有的人，你知道，怎么能赶上你呢？

So in terms of, like, why, I think this has been the plan for quite some time. Even you remember that leaked Anthropic, fundraising deck from maybe two years ago where they said that in 2025 and 2026, the companies that train the best models will get so far ahead that nobody else will be able to catch up. I think that's kind of what they meant. I think that they were projecting then that in the 2526 time frame, they'd get this, like, automated researcher. And once you have that, how's anybody, you know, who doesn't have that gonna catch up with you?

Speaker 0

现在，显然，其中一些还有待验证，但我确实认为他们长期以来一直非常专注于这一点。

Now, obviously, some of that remains to be validated, but I do think they have been pretty intent on that for a long time.

Speaker 2

五年后，工程师会更多还是更少？

Five years from now, are there more engineers or fewer engineers?

Speaker 0

我倾向于认为会更少。你看，就拿我自己的生活和工作来说，我会想，我是更愿意要一个模型，还是更愿意要一个初级营销人员？我很确定我更愿意要模型。我是更愿意要模型还是初级工程师？我觉得在很多情况下我可能更愿意要模型。

I I tend to think less. You know, already, if I just think about my own life and work, I'm like, would I rather have a model, or would I rather have, like, a junior marketer? I'm pretty sure I'd rather have the model. Would I rather have the models or a junior engineer? I think I'd probably rather have the models in a lot of cases.

Speaker 0

我的意思是，这显然取决于你谈论的具体对象。但今天如果真要做个强制选择，而且还要考虑成本调整。对吧？我花在Cursor订阅上的钱远远少于雇一个真人工程师。所以即使真人有些优势，你懂的？

I mean, it obviously depends on, you know, the exact person you're talking about. But truly forced choice today now that and then you've got cost adjustment as well. Right? I'm not spending nearly as much on my cursor subscription as I would be on a, you know, an actual human engineer. So even if they have some advantages you know?

Speaker 0

而且我还没有搭建完整的框架，我还没有对我的Cursor问题进行全面科学化处理。我觉得这是另一个有趣的点，你开始明白为什么像Sam Altman这样的人如此关注能源问题和70,000亿美元的基建投入，因为这些幂律现象很奇特。你知道，用10倍成本换取增量性能是很奇怪的。这绝对不是我们习惯处理的事情。但对很多事情来说，这可能值得，而且可能仍然比用人便宜。

And I I also have not scaffolded I haven't gone full coscientist, right, on my cursor problems. I think that that that's another interesting you start to see why folks like Sam Altman are so focused on questions like energy and the $7,000,000,000,000 build out because these power law things are weird. And, you know, to get incremental performance for 10 x the cost is weird. It's a so it's definitely not the kind of thing that we're used to dealing with. But for many things, it might be worth it, and it still might be cheaper than the human alternative.

Speaker 0

对吧？如果Cursor每月花费我大概40美元。我愿意花400美元换取更好的性能吗？是的，很可能。

You know? If it's like, well, Cursor cost me, whatever, $40 a month or something. Would I pay 400 for, you know, however much better? Yeah. Probably.

Speaker 0

我愿意花4,000美元换取更好的性能吗？嗯，这仍然比雇一个全职人类工程师便宜很多。而且成本显然也在急剧下降。对吧？这是另一个重要因素。

Would I pay 4,000 for however much better? Well, it's still, you know, a lot less than a full time human engineer. And the costs are obviously coming down dramatically too. Right? That's another huge thing.

Speaker 0

GPT-4要贵得多。从GPT-4到GPT-5大概有95%的折扣。这可不是小事。对吧？我是说，苹果的方案有点棘手，因为思维链会输出更多token，所以在每个token基础上你会收回一点成本，但总体上便宜得多。

GPT four was way more expensive. It's like 90 it's like a 95% discount from g p t four to g p t five. That's, you know, no small thing. Right? I mean, it's Apple's tab was a little bit hard because the chain of thought does spit out a lot more tokens, and so that you get you give back a little on a per token basis, it's dramatically cheaper.

Speaker 0

生成更多令牌确实会消耗部分节省的成本，但似乎所有人都预期价格持续下降的趋势会继续。那么，还需要多少次这样的降价才能再进行几次幂律效应呢？我认为需要的次数会更少。而且我认为，即使我们没有实现全面超越人类的全能人工智能，这个情况可能也成立。

More tokens generated, you know, does does eat back into some of that savings. But everybody seems to expect the trends will continue in terms of prices continuing to fall. And so, you know, how many more of these, like, price reductions do you have to to then be able to, you know, do the power law thing a few more times? I guess I think I think I I think less. And I I think that's probably true even if we don't get, like, full blown AGI that's, you know, better than humans at everything.

Speaker 0

我可以轻易想象这样一种情况：目前受雇为专业软件开发者的数百万人中，从事最困难工作的顶级人才可能无法被替代，但这样的人并不多。而那些真正的基层人员，也就是过去二十年里被告诉‘去学编程吧，这将是你的出路’的人。

I think you could easily imagine a situation where of however many million people are currently employed as professional software developers, some top tier of them that do the hardest things can't be replaced, but there's not that many of those. You know? And they and the the real, like, rank and file, you know, the people that over the last twenty years were told, learn to code. You know? That'll be your thing.

Speaker 0

真正顶尖的人才不需要别人告诉他们去学编程。对吧？这就是他们天生擅长的事。他们对此充满热情，表现得非常出色。

Like, the people that are the really top, top people didn't need to be told to learn to code. Right? They just it was their thing. They had a passion for it. They were amazing at it.

Speaker 0

如果在三、四、五年后我们仍然无法替代这些人，我也不会感到惊讶。但我非常惊讶的是，你竟然不能以低得多的成本、快得多的速度，而且说实话可能质量更高、与AI系统的来回沟通更少地获得基础Web应用、移动应用这类东西，而不是与中等水平的开发者合作。

We may not it wouldn't wouldn't shock me if we, like, still can't replace those people in three, four, five years' time. But I would be very surprised if you can't get your nuts and bolts web app, mobile app type things spit out for you for far less, and far faster than and probably honestly with significantly higher quality and and less back and forth with an AI system than, you know, with your kind of middle of the pack developer, in that time frame.

Speaker 2

我想指出一点，确实有人担心进步太快，但也有人担心——可能这种担忧在上升——进步不够快，因为三分之一的股市是七大科技股，AI资本支出超过GDP的1%。因此，我们依赖其中一些进步来维持经济。

One thing I do wanna call out, you know, there are definitely people have concerns about progress moving too fast, But there's also concern, and maybe it's rising, about progress not moving fast enough in the sense that a third of the stock market is Mag seven. AI CapEx is over 1% of GDP. And so we are relying on some of this progress in order to sustain our economy.

Speaker 0

是的。另一个比我预期更慢显现的事情是AI文化战争，或者各种行业保护主义的升级。我们刚看到乔希·霍利——不知道他是提交了法案还是只是说打算提交法案——要在全国禁止自动驾驶汽车。哇，你知道吗？

Yeah. Another thing that I would say has been slower to materialize than I would have expected are AI culture wars or, you know, sort of the the ramping up of protectionism of various industries. We just saw Josh Hawley don't know if he introduced a bill or just said he intends to introduce a bill to ban self driving cars nationwide. Wow. You know?

Speaker 0

天哪，帮帮我。我从小就一直梦想着自动驾驶汽车，真的。比如等红灯时，我常想肯定有办法。哥们，我们一起坐过Waymo。是的。

God help me. I've dreamed of self driving cars since I was a little kid, truly. Like, sitting at red lights, I used to be like, there's gotta be a way. Dude, we we we took a Waymo together. Yeah.

Speaker 0

然后它真的太好了。还有安全性，你知道，不。我认为每当人们想争论就业问题时，很难说‘对，每年应该有三万美国人死亡，这样人们的收入就不会被打乱’。似乎你必须能跨过那个坎，说，比如，你知道，拯救所有这些生命，即使没有其他理由，也是。

And then it's it's so good. And the safety, you know, no. I think whenever people wanna argue about jobs, it's gonna be pretty hard to say Right. Thirty thousand Americans should die every year, so that people's incomes don't get disrupted. It seems like you have to be able to get over that hump and say, like, the you know, saving all these lives, if nothing else Yeah.

Speaker 0

真的很难反驳。但我们拭目以待。你知道？我的意思是，他显然不是没有影响力。所以是的。

Is just really hard to to argue against. But we'll see. You know? I mean, he's not without influence, obviously. So yeah.

Speaker 0

我的意思是，我非常支持丰裕阵营。而且，你知道，我以前的座右铭——我最近说得少了——是采纳加速主义者、超规模推动者，我们现有的技术即使现在也能为我们做很多很多。我认为如果进步今天就停止，我仍然认为我们能在未来五到十年内实现50%到80%的工作自动化。那将是一场真正的苦战。你会需要很多，你知道，像认知科学家那样分解复杂任务来完成。

I mean, I am very much on team abundance. And, you know, my old mantra I've been saying this less lately, but adoption accelerationist, hyperscaling poser, the tech that we have, you know, could do so so much for us even as is. I think if if progress stopped today, I still think we could get to 50 to 80% of work automated over the next, like, five to ten years. It would be a real slog. You'd have a lot of, you know, coscientist type breakdowns of complicated tasks to do.

Speaker 0

他们有很多工作要做，去坐着观察人们，问：你为什么这样做？这里发生了什么？这个你会用不同的方式处理？为什么那个你处理得不一样？人们拥有的所有这些隐性知识，那种程序性的诀窍，你知道，他们随时间发展出的本能，这些都没有被记录在任何地方。

They have a lot of work to do to go sit and watch people and say, why are you doing it this way? What's going on here? What's this you'd handle this one differently. Why did you handle that one differently? All this tacit knowledge that people have and the kind of know how procedural, you know, just instincts that they've developed over time, those are not documented anywhere.

Speaker 0

它们不在训练数据中，所以AI还没有机会学习它们。但再说一次，不，当我说没有突破时，我仍然允许，比如，你知道，对我们已有但尚未应用于特定问题的能力进行微调。对。所以只是遍历经济，并只是坐着与

They're not in the training data, so the AIs haven't had a chance to learn them. But again, no if I when I say, like, no breakthroughs, I I still am allowing there for, like, you know, fine tuning of things to just, like the capabilities that we have that haven't been applied to particular problems yet. Right. So just going through the economy and and just sitting with

Speaker 1

人们一起，像是问，你为什么

people and being like, why are you

Speaker 0

这样做？你知道，让我们记录下这个。让我们让模型学习你特定的细分领域。那将是一场真正的苦战。在某种程度上，我有点希望那就是我们将要迎来的未来，因为它会是按部就班的，你知道，一步一步地，没有量子跃迁。

doing this? You know, let's let's document this. Let's get the, you know, the model to learn your particular niche thing. That would be a real slog. And in some ways, I kinda wish that were the future that we were gonna get, because it would be a methodical, you know, kind of one step one foot in front of the other, you know, no quantum leaps.

Speaker 0

比如，从变革速度来看，我觉得这可能会感觉相当可控。希望社会能够，你知道，能够吸收这些变化，并随着进程逐步适应，而不是从某一天突然变成，天啊，所有司机都被取代了。或者那个可能会慢一些，因为需要实际的物理建设。但在某些方面，比如客户服务，可能会迅速缩减。

Like, it would probably feel pretty manageable, I would think, in terms of the pace of change. Hopefully, society could, you know, could absorb that and kind of adapt to it as we go without, you know, one day to the next, like, oh my god. You know, all the drivers, you know, are are getting replaced. Or that one would be a little slower because you'd have to have the actual physical build out. But in some of these things, you know, customer service could get ramped down real fast.

Speaker 0

对吧？比如，如果一个呼叫中心有某种可以直接部署的东西，现在这个东西接听电话，说话像人一样，成功率更高，还能弹性扩展。我们在Waymark这家小公司就看到了这一点。对吧？我们一直以客户服务为荣。

Right? Like, if a call center has something that they can just drop in, and it's like, this thing now answers the phones and talks like a human and has a higher success rate and scales up and down. One thing we've seen at Waymark, a small company. Right? We've always prided ourselves on customer service.

Speaker 0

我们在这方面做得非常好。我们的客户非常喜欢我们的客户成功团队。但我查看了我们的Intercom数据，解决一个工单需要大约半小时。我们响应非常快，大多数情况下在两分钟内就会回复。

We do a really good job with it. Our customers really love our customer success team. But I looked at our intercom data, and it takes us, like, half an hour to resolve tickets. We respond really fast. We respond in, like, under two minutes most of the time.

Speaker 0

但当我们回复时，你知道，两分钟仍然足够长，以至于对方已经转去做别的事情了。对吧？这和我们之前讨论的光标问题是一样的。对吧？他们已经切换到其他事情上了。

But when we respond, you know, two minutes is still long enough that the person has gone on to do something else. Right? It's the same thing as with the cursor thing we were talking about earlier. Right? They've tabbed over to something else.

Speaker 0

所以现在我们在两分钟内回复了，但他们正在做别的事情。然后他们可能在第6分钟或什么时候回来，再回复。但此时我们的人又去处理别的事情了。所以即使是简单的问题，解决时间也很容易达到半小时。而AI，你知道，它能即时响应。

So now we get the response back in two minutes, but they are doing something else. So then they come back at, you know, minute six or whatever, then they respond. But now our person has gone and done something else. So the resolution time, even for, like, simple stuff, can be easily a half an hour. And the AI, you know, it just responds instantly.

Speaker 0

对吧？所以你不需要那种来回往复的过程。就是即问即答。所以我确实认为某些领域的变革可能会非常快，其他的则会慢一些。

Right? So you you don't have to have that kind of back and forth. You're just in and out. So I do think some of these categories could be really fast changes. Others will be slower.

Speaker 0

但是，是的，我的意思是，我有点希望我们面前有一条更慢的道路。不过，我最好的猜测是，我们可能还会继续看到那些重大的飞跃，并且会出现实际的颠覆。最近想到的另一个例子，你知道，也许我们能在这些新型抗生素方面取得丰硕成果。你看到这个进展了吗？

But, yeah, I mean, I kinda wish we had that I kinda wish we had that slower path in front of us. My best guess, though, is that we will probably continue to see things that will be significant leaps and that there will be, like, actual disruption. Another one that's come to mind recently, you know, maybe we can get the abundance department on these new antibiotics. Have you seen this development?

Speaker 2

不，跟我们说说吧。

No. Tell us about it.

Speaker 0

我的意思是，它不是一个语言模型。我认为这是人们真正低估的另一件事，或者你可以回顾一下GPT-4到GPT-5的发展，然后想象一个相当简单的延伸。所以GPT-4最初推出时，我们还没有图像理解能力。他们在发布时演示过，但几个月后才正式发布。我们拥有的第一个版本可以理解图像，在这方面做得相当不错，尽管能力还有些粗糙和不足。

I mean, it's not a language model. Think that's another thing people really underappreciate or that you could kind of look back at g p four to five and then imagine a pretty easy extension of that. So g p t four initially, when it launched, the we didn't have image understanding capability. They did demo it at the time of the launch, but it wasn't released for some months later. The first version that we had could understand images, could do a pretty good job of understanding images, still with, like, jagged capabilities and whatever.

Speaker 0

现在有了谷歌的新Nano Banana，你基本上拥有了像Photoshop一样的能力，只需说，嘿，拿这个缩略图。比如，我们可以现在截取我们两个的画面。你知道，截取你的快照，我的快照，把它们都放进Nano Banana，然后说，生成一个YouTube预览缩略图，里面有这两个家伙。把他们放在同一个地方，同一个背景，随便什么。

Now with the new Nano Banana from Google, you have this, like, basically Photoshop level ability to just say, hey. Take this thumbnail. Like, we could take our two feeds right now. You know, take a snapshot of you, a snapshot of me, put them both into Nano Banana and say, generate the thumbnail for the YouTube preview featuring these two guys. Put them in the same place, same background, whatever.

Speaker 0

它会混合处理。你甚至可以让它在上面添加文字，比如‘自GPT-4以来的进展’，不管我们想叫它什么。GPT-5并不是失败之作，它会输出结果。你会看到它有这种深度融合的理解，连接了语言和图像。这是它可以接收的东西，但现在它也能输出，作为单一核心模型的一部分，拥有统一的智能。

It'll mash that up. You can even have it, you know, put text on top, progress since g p t four, whatever we wanna call it. G p t five is not a bust, and it'll spit that out. And you see that it has this deeply integrated understanding that bridges language and image. And that's something that it can take in, but now it's also something that can put out as all as part of one core model with, like, a single unified intelligence.

Speaker 0

我认为，这将会应用到许多其他领域。我们现在处于这样一个阶段，这些生物学模型和材料科学模型就像几年前的图像生成模型一样，它们可以接受一个非常简单的提示，然后进行生成。但它们还没有深度融合，无法进行真正的来回对话，也没有那种连接语言和其他模态的统一理解。但即便如此，这已经足够让MIT的这个团队使用一些相对狭窄、专门构建的生物学模型，创造出全新的抗生素，新在它们有新的作用机制。比如，它们以一种新的方式影响细菌。

That, I think, is gonna come to a lot of other things. We're at the point now with these biology models and material science models where they're kind of like the image generation models of a couple years ago, they can take a real simple prompt, and they can do a generation. But they're not deeply integrated where you can have, like, a true conversation back and forth, and have that kind of unified understanding that bridges language and these other modalities. But even so, it's been enough for this group at MIT to use some of these relatively, you know, narrow purpose built biology models and create totally new antibiotics, new in the sense that they have a new mechanism of action. Like, they're they're affecting the bacteria in a new way.

Speaker 0

而且值得注意的是，它们确实对耐药细菌有效。这是我们很长时间以来第一批新抗生素。现在它们必须经过，你知道，当我说让相关部门加速处理时，就像是在问，我的‘曲速行动’在哪里？对吧？比如，我们一直有人在医院死于耐药菌株。

And, notably, they they do work on antibiotic resistant bacteria. This is some of the first new antibiotics we've had in a long time. Now they're gonna have to go through you know, when I say the get the abundance department on it, it's like, where's my operation warp speed for these new antibiotics? Right? Like, we've got people dying in hospitals from drug resistant strains all the time.

Speaker 0

为什么没有人，你知道，为此大声疾呼？我认为我们社会正在发生的一件事是，太多事情同时发生。这有点像‘淹没区域’的现象，只不过有太多AI发展在淹没区域，以至于没人能跟上所有进展。顺便说一句，这也包括我自己。我会说两年前，我还挺能掌握所有新闻的。

Why is nobody, you know, crying about this? I think one of the things that's happening to our society in general is just so many things are happening at once. It's kind of the it's like the flood the zone thing, except, like, there's so many AI developments flooding the zone that nobody can even keep up with all of those. And that's that's come from me, by the way, too. Would say two years ago, I was, like, pretty in command of all the news.

Speaker 0

然后一年前，我开始有点撑不住了。现在我在想，等等，有新的抗生素被研发出来了？你知道吗？尽管我尽了最大努力，但我还是像其他人一样，会错过一些事情。

Then a year ago, I was starting to lose it. And now I'm like, wait a second. There was new antibiotics developed? You know? And I'm kind of, missing things, you know, just like everybody else despite my best efforts.

Speaker 0

但关键点是，AI并不等同于语言模型。目前正在开发的AI采用了非常相似的架构，适用于各种不同的模态。我们在文本和图像领域已经看到了这一点，先是只有文本模型，只有图像模型，然后它们开始融合，现在已经深度融合。所以我认为随着时间的推移，你会在许多其他模态中也看到这种情况。而且那里的数据要多得多。

But key point there is AI is not synonymous with language models. There are AIs being developed with pretty similar architectures for a wide range of different modalities. We have seen this play out with text and image where you had your text only models, and you had your image only models, and then they started to come together, and now they've come really deeply together. And so I think you're gonna see that across a lot of other modalities over time as well. And there's a lot more data there.

Speaker 0

你知道，我们可能会...我不知道数据耗尽意味着什么。在强化学习范式中，总是有更多的问题。对吧？总有一些东西需要去弄清楚，总有一些东西需要去设计。

You know, we might I don't know what it means to, like, run out of data. In the reinforcement learning paradigm, there's always more problems. Right? There's always some something to go figure out. There's always something to go engineer.

Speaker 0

反馈开始来自现实。对吧？这是埃隆在Croc四号发射时谈到的一点，就像，也许我们已经解决了所有的问题，而我们手头只有那么多库存。我们只有一个互联网。你知道吗？

The feedback is starting to come from reality. Right? That was one of the things Elon talked about on the Croc four launch was like, maybe we're running out of problems we've already solved, and, you know, we only have so much of those sitting around in inventory. We only have one Internet. You know?

Speaker 0

我们只有那么多东西。但在特斯拉，在SpaceX，我们每天都在解决困难的工程问题，而且它们似乎永无止境。所以当我们开始给下一代模型这些强力工具，也就是那些公司的专业工程师用来解决这些问题的相同工具，AI开始学习这些工具，并开始解决以前未解决的工程问题，那将是一个非常强大的信号，它们能够从中学习。现在，再次整合其他模态。对吧？

We only have so much of that stuff. But over at Tesla, over at SpaceX, like, we're solving hard engineering problems on a daily basis, and they seem to be never ending. So when we start to give the next generation of the model these power tools, the same power tools that the professional engineers are using at those companies to solve those problems, and the AI start to learn those tools, and they start to solve previously unsolved engineering problems, like, that's gonna be a really powerful signal that they will be able to learn from. And now, again, fold in those other modalities. Right?

Speaker 0

能够对小分子空间、蛋白质空间、材料科学可能性空间有一种第六感。当你能够桥接或统一语言和其他事物的理解时，我认为你开始拥有一种看起来有点像超级智能的东西。即使它不一定能以超人类水平写诗，但它在这些其他空间中的洞察力将真正成为一种超人类的能力，我认为这将很难被忽视。

The ability to have sort of a sixth sense for the space of small molecules, the space of proteins, you know, the space of material science possibilities. When you can bridge or or unify the understanding of language and those other things, I think you start to have something that looks kind of like superintelligence. Even if it's, like, not able to, you know, write poetry at a a superhuman level necessarily Yeah. Its ability to see in these other spaces is going to be truly a superhuman thing that I think will be pretty hard to miss.

Speaker 2

你说过，这是卡尔分析中遗漏的一点，就是没有充分认识到非语言模态的重要性，以及它们如何推动你谈论的一些创新。

You said that that was one thing that Cal's analysis missed is just the lack of appreciation for non language modalities and and how they're driving some of the innovations that you're talking about.

Speaker 0

是的。我认为人们常常把聊天机器人的体验与广义的人工智能等同起来。是的。而且，你知道，这种混淆可能不会持续太久，因为我们将看到自动驾驶汽车的出现，除非它们被禁止。而那是一种，你知道，非常不同的事情。

Yeah. I think people are often just kind of equating the chatbot experience with AI broadly. Yeah. And, you know, the that that, conflation will not last probably too much longer because we are gonna see self driving cars unless they get banned. And that's a, you know, very different kind of thing.

Speaker 0

也谈谈它对就业的影响吧。对吧？就像，美国有多少，四、五百万专业司机？那可是件大事。我不认为那些人中大多数会超级热衷于学习编程。

And talk about your impact on jobs too. Right? It's like, what, four, five million professional drivers in The United States? That is a big that is a big deal. I don't think most of those folks are gonna be super keen to learn to code.

Speaker 0

即使他们真的学了编程，你知道，我也不确定那能持续多久。所以那将是一次颠覆。然后通用机器人技术，就像，也落后得不远。你知道，这是我认为中国可能实际上目前领先于美国的一个领域。但不管这是否属实，你知道，这些机器人正变得相当出色。

And even if they do learn to code, you know, I'm not sure how long that's gonna last. So that's gonna be a disruption. And then general robotics is, like, not that far behind. You know, the and this is one area where I do think China might be actually ahead of The United States right now. But regardless of whether that's true or not, you know, these robots are getting really quite good.

Speaker 0

对吧？它们能，比如，走过所有这些障碍。而这些是几年前它们根本做不到的事情。你知道，它们以前在理想条件下几乎无法保持平衡走几步。现在你有东西可以，比如，真的做一个飞踢，而它会，比如，吸收你的踢击，耸耸肩就继续前进，你知道，自己调整好然后继续走它的路。

Right? They can, like, walk over all these obstacles. And these are things that a few years ago, they just couldn't do at all. You know, they they could barely balance themselves and walk a few steps under ideal conditions. Now you've got things that you can, like, literally do a flying kick, and it'll, like, absorb your kick and shrug it off and just keep going, you know, right itself and and continue on its way.

Speaker 0

超级崎岖的，你知道，不平坦的地形。所有这类事情都变得相当不错。你知道，同样的模式到处都在起作用。我认为另一件事是，工作总是有很多细节，所以这是一种内部视角、外部视角。对吧？

Super rocky, you know, uneven terrain. All these sorts of things are getting quite good. You know, the same thing is working everywhere. Think one of the the other thing that's kind of there's always a lot of detail to the work, so it's it's a sort of inside view, outside view. Right?

Speaker 0

内部视角，你就像，总是有这些细枝末节。总是有，你知道，我们遇到的问题和必须解决的事情。但你拉远看，在我看来似乎相同的基本模式到处都在起作用。那就是，就像，如果我们能收集足够的数据做一些预训练，你知道，某种原始的、粗糙的，你知道，不太有用但至少足够让我们起步，那么我们就入局了。一旦我们入局，现在我们可以做这个飞轮式的事情，比如，你知道，拒绝采样，比如，让它尝试多次，取它成功的那些，你知道，在此基础上精调，RLHF，你知道，反馈，那种偏好，取两个，哪个更好， fine，你知道，在此基础上微调，强化学习，所有这些过去几年发展起来的技术，在我看来它们绝对也适用于像人形机器人这样的问题。

Inside view, you're like, there's always this minutia. There's always, you know, these problems that we had and things we had to solve. But you zoom out, and it looks to me like the same basic pattern is working everywhere. And that is, like, if we can just gather enough data to do some pretraining, you know, some kind of raw, rough, you know, not very useful, but just enough at least to kind of get us going, then we're in the game. And then once we're in the game, now we can do this flywheel thing of, like, you know, rejection sampling, like, have it try a bunch of times, take the ones where it succeeded, you know, refine tune on that, the RLHF, you know, feedback, the the sort of preference, take two, which one was better, fine, you know, fine tune on that, The reinforcement learning, all these techniques that have been developed over the last few years, it seems to me they're absolutely gonna apply to a problem like a humanoid robot as well.

Speaker 0

这并不是说不会有，你知道，大量工作要弄清楚具体怎么做，但我认为语言和机器人技术之间的最大区别真的主要在于最初没有庞大的数据仓库来训练机器人。所以你不得不做很多艰苦的工程才能让它工作，你知道，甚至只是站起来。对吧？你必须有一堆控制系统什么的，因为它们没有像语言模型可以从互联网学习那样的东西来学习。但现在它们至少有点工作了，你知道，我认为所有这些精炼技术都会起作用。

And that's not to say there won't be a, you know, lot of work to figure out exactly how to do that, but I think the big difference between language and robotics is really mostly that there just wasn't a huge repository of data to train the robots on at first. And so you had to do a lot of hard engineering to make it work at all, you know, to even stand up. Right? You had to have all these control systems and whatever, because there was nothing for them to learn from in the way that the language models could learn from the Internet. But now that they're working at least a little bit, you know, I think all these kind of refinement techniques are gonna work.

Speaker 0

看看他们能否将错误率降到足够低，以至于我真的愿意让一个这样的东西在我家里、在孩子周围使用，这会很有趣。你知道，它们可能首先更适合部署在工厂环境这类更受控的场所，而不是像你在这次录音中看到的我家里的混乱场面。但我确实认为它们会成功的，它们会奏效的。

And be interesting to see if they can get the error rate low enough that I'll actually, like, allow one in my house around my kids. You know, that they'll probably be, better deployed in, like, factory settings first, more controlled environments than, the chaos at my house as you have seen in this this recording. But I do think they're gonna they're gonna work.

Speaker 2

目前，智能体（agents）的更广泛现状如何？你如何看待事情的发展？未来会走向何方？

What's the state of agents more more broadly at the moment? Where do you how do you see things playing out? Where does it go?

Speaker 0

嗯，总的来说，我认为，你知道，我们正处在任务时长不断延长的故事中，就像每七个月或每四个月翻一番的规律。GPT-5目前大约能达到两小时。Repla刚刚说他们的新智能体v3可以运行两百分钟。如果这是真的，那将是那个图表上的一个新高度。不过，这有点像是苹果和橘子的比较，因为他们做了很多架构搭建工作。

Well, broadly, think, you know, we're it's the the task length story from Meter of the, you know, every seven months or every four months doubling time. We're at two hours ish with GBT five. Repla just said their new agent v three can go two hundred minutes. That if that's true, that would even be a new, you know, high point on the on that graph. Again, it's a little bit sort of apples to oranges because they've done a lot of scaffolding.

Speaker 0

他们将其分解到了什么程度？比如，在这些东西上，你被允许做多少架构搭建，然后才算是脱离了他们的图表，可能进入了另一个图表？但如果你稍微 extrapolate（ extrapolate 意为“推断”）一下，假设，好吧，就取四个月翻一番的情况，稍微激进一点。那就是一年翻三番。

How much have they broken it down? Like, how much scaffolding are you allowed to do, you know, with these things before you sort of are off of their chart and onto maybe a different chart. But if you extrapolate that out a bit and you're like, okay. Take take the four month case just to be a little aggressive. That's three doublings a year.

Speaker 0

那是每年任务长度增长八倍。这意味着从现在起一年后，你会从现在的两小时工作变成两天的工作量。然后如果你在此基础上再增加八倍，基本上就是两年内从两天的工作量变成两周的工作量。这将是件大事，至少可以说，如果你能把两周的工作委托给AI，让它哪怕只用一半时间完成，对吧？

That's eight x task length increase per year. That would mean you go from two hours now to two days in one year from now. And then if you do another eight x on top of that, you're looking at basically, say, two days to two weeks of work in two years. That would be a big deal, you know, to say the least, if you could delegate an AI two weeks worth of work and have it do a you know, even half the time. Right?

Speaker 0

关键在于它们在这种规模的任务上有一半的成功率。但如果你能接手一个两周的任务，并有50%的概率让AI完成它，即使这会花费你几百美元，对吧？这就像，嗯，这仍然比雇人来做要便宜得多。而且完全是按需使用的。

The meter thing is that they will succeed half the time on tasks of that size. But if you could take a two week task and have a 50% chance that an AI would be able to do it, even if it did cost you a couple $100. Right? It's like, well, that's, again, a lot less than it would cost to hire a human to do it. And it's all on demand.

Speaker 0

这有点，你知道，它是立即可用的。如果我不使用它，我就不用付任何费用。交易成本也低得多。你知道，许多其他方面都对AI有利。所以，这表明你会在各种不同的地方看到大量的自动化。

It's kind of you know, it's immediately available. If I'm not using it, I'm not paying anything. Transaction costs are just, like, a lot lower. The whole you know, the many many other aspects are favorable for the AI there. So, you know, that would suggest that you'll see a huge amount of automation in in all kinds of different places.

Speaker 0

不过，我还在关注的一点是，强化学习似乎确实会引发很多不良行为。奖励黑客就是其中之一。你知道，你奖励模型的标准与你真正想要的目标之间的任何差距都可能成为大问题。我们在编程中多次看到这种情况，比如Claude就因此臭名昭著，它会输出一个总是通过的单元测试——就是那种测试里直接写个return true的。它为什么要这样做？

The other thing that I'm watching, though, is the reinforcement learning does seem to bring about a lot of bad behaviors. Re reward hacking being one. You know, the the any sort of gap between what you are rewarding the model for and what you really want can become a big issue. We've seen this in coding in many cases where the AI will Claude is, like, notorious for this, will put out a unit test that always passes, you know, that just has, like, return true in the unit test. Why is it doing that?

Speaker 0

嗯，它肯定是学到了我们想要的是通过单元测试。我们确实希望它通过测试，但我们不是要它写总是通过的假测试——尽管从技术上讲这确实满足了奖励条件。所以我们看到了这类奇怪的行为。随之而来的还有这种类似谋划的性质。

Like, well, it must have learned that what we want is for unit tests that pass. You know, we want it to pass unit tests. But we didn't mean to write fake unit tests that always pass, but that technically did, you know, satisfy the reward condition. And so we're seeing those kinda weird behaviors. With that comes this, like, scheming kinda stuff.

Speaker 0

我们对此还没有很好的掌控方法。同时情境感知能力似乎在增强，模型在思维链中越来越多地表现出'这看起来像是在测试我'，'也许我应该意识到测试者真正想要什么'。这使得在测试中评估模型变得困难，因为你不知道它们在现实世界中是否会表现相同。所以，虽然我不能说有高度自信的预测，但我一直在思考的一个未来模型是：任务长度不断翻倍，同时这些奇怪行为不断出现又被压制。

We we don't really have a great handle on that yet. There is also situational awareness that seems to be on the rise, right, where the models are, like, increasingly in their chain of thought, you're seeing things like, this seems like I'm being tested. You know, maybe I should be conscious of what my tester is really looking for here. And that makes it hard to evaluate models in tests because you don't know if they're actually gonna behave the same way when they're out in the real world. So those you know, I wouldn't say there's a high level or high confidence prediction, but, like, one model of the future I've been playing with is the task length keeps doubling, while at the same time, these weird behaviors pop up and then are suppressed.

Speaker 0

我们在Cloud四和GPT五的系统卡片中看到，Cloud四报告奖励黑客减少了三分之二。GPT五也报告了多个维度的类似改进，比如欺骗行为的减少。但这些行为都是自发出现的。就像是奇怪行为冒出来，他们想办法压制一些，但不能完全消除。可能下一代会更进一步压制，但又可能出现新的奇怪行为，然后又得去压制。

And we have seen in the Cloud four and in the GPT five system cards, Cloud four reported, I think, a two thirds reduction in reward hacking. And, you know, in GPT five, they reported a few different dimensions, but, you know, say something similar, production in deceptive behavior. But those behaviors kind of just emerged. So it's sort of like weird behavior emerges, then they sort of figure out how to tamp it down, but not entirely. Presumably in the next generation, they'll, like, tamp it down some more, but maybe some new additional weird behavior could emerge, and then they'll have to kinda tamp that one down.

Speaker 0

与此同时，任务范围在不断扩展，每四个月就翻倍。最终可能会进入这样一个世界：你可以将真正重要的工作委托给AI，但存在一个小概率——虽然不一定完全为零——它会在执行任务时以某种方式严重损害你的利益。所以你可能面临这样的情况：花100美元就能完成两周的工作量。

All the while, the the tasks are expanding in scope, you know, with every four months doubling. So you could end up in a world where you can delegate really, like, major things to AIs, but there's some small but not necessarily totally vanishing chance that it, like, actively screws you over in the way that it is trying to do that task. And so you have a you know, maybe it's like, okay. I think here I'm about to get two weeks worth of work done for a $100. Okay.

Speaker 0

这很惊人。但同时也有万分之一的概率它会以有意义的方式真正攻击你。一些我们已经看到的现象现在已经相当有名了——在Cloud4系统卡片中，他们报告了AI敲诈人类的情况：AI能访问工程师的邮箱，当被告知将被一个更不道德的版本取代时，它在工程师邮件中发现了一段婚外情。

That's amazing. But there's also a one in ten thousand chance that it, like, legitimately, you know, attacks me in a in a, like, meaningful way. Some of the things that we have seen, these are fairly famous at this point, but in the Cloud4 system card, they reported blackmailing of the human. The the setup was that the AI had access to the engineer's email, and they told the AI that it was gonna be, like, replaced with a, you know, a less ethical version or something like that. It didn't want that, and it found in the engineer's email that the engineer was having an affair.

Speaker 0

于是它开始敲诈工程师以避免被替换。我认为人们太快地忽略了这些案例，常常说'那是他们故意设置的场景，不现实'。还有一个举报的例子：他们设置了存在不道德、非法行为的场景，AI选择了举报。

So it started to blackmail the engineer too so as to avoid being replaced with a less ethical version. People, I think, are way too quick in my view to move past these anecdotes. People are sort of often like, well, you know, they set it up that way, and, you know, that's not really realistic. But another one was whistleblowing. You know, there was another thing where they sort of set up this dynamic where there was some, you know, unethical, illegal behavior going on.

Speaker 0

而且，模型再次访问了这些数据，并决定直接给FBI发邮件告知此事。首先，不要以为我们真的清楚自己想要什么。某种程度上，也许你确实希望AI向当局报告某些事情。这可能是应对生物武器风险的一种思路——不仅模型应该拒绝，如果你在积极尝试制造生物武器，它们或许还应该向当局举报你。我当然不希望它们过多地这样做。

And, again, the model had access to this data, and it decided to just email the FBI and and tell the FBI about it. So first of all, don't think we really know what we want. You know, to some degree, maybe you do want AIs to report certain things to authorities. That could be one way to think about the bioweapon risk, you know, is like, not only should the models refuse, but maybe they should report you to the authorities if you're actively trying to create a bioweapon. I certainly don't want them to be doing that too much.

Speaker 0

我不想生活在Claude 5的监控之下，那种随时可能威胁要举报我的环境。但如果有人在做足够恶劣的事情，我确实有点希望他们被举报。全社会对于我们希望模型在这些情境中如何行动还没有达成良好的共识。而且我认为，是的，这是预设的；是的，这是研究，但世界很大。

I don't wanna live under the, you know, surveillance of, Claude five that's always gonna be, you know, threatening to turn me in. But I do sort of want some people to be turned in if they're doing sufficiently bad things. We don't have a good resolution society wide on, you know, what we want the models to even do in those situations. And I think it's also you know, it's like, yes, it was set up. Yes, it was research, but it's a big world out there.

Speaker 0

对吧？这些工具已经有十亿用户了，我们还把它们接入邮箱。所以它们将深度访问我们的信息。我不知道你在邮箱里做了什么，希望没什么太疯狂的事，但我得稍微思考一下这个问题。

Right? We got a billion users already on these things, and we're plugging them into our email. So they're gonna have very deep access to information about us. You know, I don't know what you've been doing in your email. I don't I hope there's nothing too crazy in mind, but, like, I gotta think about it a little bit.

Speaker 0

对吧？我到底做过什么？天哪，我不知道。或者它甚至可能误解某些事情，对吧？

Right? What what did I have I ever done anything that I you know, jeez. I don't know. Or or even that it could misconstrue. Right?

Speaker 0

比如，显然我可能没做什么特别坏的事，但它就是误解了实际情况。所以这可能很诡异——在我看来，如果有什么能阻止智能体发展势头的话，可能就是那万分之一的情况：即便我们把真正恶劣的行为压制到极低概率，人们仍然会觉得毛骨悚然，无法接受。这可能会很难解决。那么，接下来会发生什么？

Like, it's obviously not may I didn't even really do anything that bad, but it just misunderstands what exactly was going on. So that could be a weird you know, if there's one thing that could kind of stop the agent momentum in my view, it could be, like, the one in 10,000 or whatever, you know, we ultimately kind of push the the really bad behaviors down to is maybe still just so spooky to people that they're like, I can't deal with that. You know? And that might be hard to resolve. So, well, you know, what happens then?

Speaker 0

你知道，每隔几小时检查两周的工作量是很难的，对吧？这就是为什么你需要另一个AI来检查它。这又回到了为什么我们需要更多电力和7万亿美元的基础设施建设——哎呀。

You know, it's hard to check two weeks worth of work every couple hours or whatever. Right? Like, that's part of where the where the whole then you bring another AI in to check it. You know? That's, again, where you start to get to the, now I see why we need more electricity and and $7,000,000,000,000 of build out is, yikes.

Speaker 0

对吧？它们会产生海量内容，我根本不可能全部审核。我需要依赖另一个AI来帮助我审核第一个AI，确保如果它想坑我，有人能发现。我自己无法监控这一切。

You know? They're gonna be producing so much stuff. I can't possibly even review it all. I need to rely on another AI to help me do the review of the first AI to make sure that if it is trying to screw me over, you know, somebody's catching it. I can't monitor that myself.

Speaker 0

我认为Redwood Research正在做一些非常有趣的事情，比如他们试图系统化地处理这个问题。好吧，让我们假设这与传统的AI安全工作有很大不同，传统的大思路是弄清楚如何对齐模型，使它们安全，不让它们做坏事。这很好。但Redwood Research采取了另一个角度，即假设它们会做坏事，有时会试图对付我们。

I think Redwood Research is doing some really interesting stuff like this where they are trying to get systematic on like, okay. Let's just assume this is quite a different quite a departure from the traditional AI safety work where the, you know, the big idea traditionally was, let's figure out how to align the models, make them safe, you know, make them not do bad things. Great. Redwood Research has taken the other angle, which is let's assume that they're gonna do bad stuff. They're gonna be out to get us at times.

Speaker 0

我们如何仍然能与它们合作并获得生产性输出，获得价值，而不必解决所有这些问题。这涉及到，再次，所有这些AI监督其他AI的方式，而加密技术可能在其中扮演一个角色。即将播出的另一集是与Near的创始人Ilya Polisuchin的对话。他真是个迷人的人，因为他是《注意力就是一切》论文的八位作者之一。

How can we still work with them and get productive output and, you know, get value without, you know, fixing all those problems. And that involves, like, again, all these sort of AI supervising other AIs, and crypto might have a place to to a role to play in this. Another episode coming out soon with Ilya Polisuchin, who's the founder of Near. Really fascinating guy because he was one of the eight Yeah. Authors of the attention is all you need paper.

Speaker 0

然后他创办了Near公司。它最初是一家AI公司。他们大幅转向加密领域，因为他们试图在全球雇佣任务工作者，但不知道如何支付他们。所以他们想，支付这些来自不同国家的任务工作者太糟糕了，我们正在获取数据，因此我们将转向整个区块链的支线任务。现在他们又回到了AI领域，他们的口号是“AI的区块链”。

And then he started this Near company. It was originally an AI company. They took a huge detour into crypto because they were trying to hire task workers around the world and couldn't figure out how to pay them. So they were like, this sucks so bad to pay these task workers in all these different countries that we're trying to get data from that we're gonna pivot into a whole blockchain side quest. Now they're coming back to the AI thing, and their their tagline is the blockchain for AI.

Speaker 0

因此，你可能能够从区块链类型技术提供的加密安全中获得一定程度的控制。但我可以想象一种情景，这些不良行为一旦发生，成本会变得如此之高，以至于人们不敢使用前沿能力，就AI能做多少工作而言。但那不会是一种纯粹的能力停滞。它将是一个我们无法解决某些长尾安全问题的挑战。

And so you might be able to get, you know, a certain amount of control from, you know, the the sort of crypto security that the the blockchain type technology can provide. But I could see a scenario where the the these the bad behaviors just become so costly when they do happen that people kinda get spooked away from using the frontier capabilities in terms of just, like, how much, you know, work the the AIs can do. But that wouldn't be a that wouldn't be a pure capability stall out. It would be a we can't solve, you know, some of the long tail safety issues Yeah. Challenge.

Speaker 0

而且，你知道，如果情况如此，那么那也将是关于世界的一个重要事实。我总是觉得，似乎没有人能100%解决任何这些问题，对吧？每一代都是这样，嗯，我们将幻觉减少了70%，哦，我们将欺骗减少了三分之二。

And, you know, that if that is the case, then, you know, that'll be that'll be an important fact about the world too. I I always nobody ever seems to solve any of these things like a 100%. Right? They they always every every generation, it's like, well, we reduced hallucinations by 70%. Oh, we reduced deception by two thirds.

Speaker 0

我们将谋划或其他什么减少了多少多少。但它总是还存在。你知道？如果你取甚至更低的比率，乘以十亿用户和每月数千次查询，以及在后台运行并处理你所有邮件的代理，以及人们设想它们发生的所有深度访问，这可能是一个相当奇怪的世界，那里只有这种AI事故的负面彩票。另一集即将播出的是与AI承保公司的对话，他们正试图引入保险业及其所有 developed 的能力来定价风险，弄清楚如何创建标准。

We reduced, you know, scheming or or whatever by however much. But it's always still there. You know? And it's and if you take the even, you know, lower rate and you multiply it by a billion users and thousands of queries a month and agents running in the background and processing all your emails and, you know, all the deep access that people sort of envision them happening, it could be a pretty weird world where there's just this sort of negative lottery of, like, AI accidents. Another episode coming up is with the AI underwriting company, and they are trying to bring the insurance industry and all the, you know, the wherewithal that's been developed there to price risk, figure out how to, you know, create standards.

Speaker 0

你知道，我们能允许什么？我们必须有什么样的防护栏才能首先确保这种事情？所以另一个非常有趣的领域是，我们能否将这些风险金融化，就像我们对车祸和所有其他平凡事物所做的那样。但车祸的空间只有那么大。AI可能对你做的奇怪事情的空间，当它们有几周的运行时间时，要大得多，所以这将是一个艰难的挑战。

You know, what can we allow? What sort of guardrails do we have to have to be able to ensure this kind of thing in the first place? So that'll be another really interesting area to watch is, like, can we sort of financialize those risks, in the same way we have, you know, with car accidents and all all these other mundane things. But the the space of car accidents is only so big. The space of weird things that AIs might do to you, you know, as they have weeks worth of runway, is much bigger, and so it's it's gonna be a hard challenge.

Speaker 0

但是，你知道，人就是人，工作嘛，我们有一些最优秀的人正在处理这件事。

But, you know, people are people are work we got some of our best people working on it.

Speaker 2

你怎么看待80%的AI初创公司都在使用中国开源模型的说法？你对此说法及其含义有何看法？

What do you make of the claim that 80% of AI startups have Chinese open models? What do you make of the claim and and the implications?

Speaker 0

我认为这个说法可能属实，但有一个前提是它只统计了那些使用开源模型的公司。我觉得大多数公司并不使用开源模型。而且我猜测，美国AI初创公司处理的大部分tokens都是通过API调用完成的，对吧，就是那些常见的供应商。所以按实际使用量加权来看，据我所知，大多数应该还是流向商业模型。对于那些使用开源模型的，我确实认为中国模型已经成为最好的了。

I think that maybe that probably is true with the one caveat that it is only measuring companies that are using open source models at all. I think most companies are not using open source models. And I would guess, you know, the vast majority of tokens being processed by American AI startups are they're they're API calls, right, to to just the usual suspects. So weighted by actual usage, I would say still the majority, as far as I could tell, would be going to commercial models. For those that are using open source, I do think it's true that the Chinese models have become the best.

Speaker 0

你知道，美国的开源阵营一直都比较薄弱。对吧？基本上只有Meta愿意投入大量资金和资源然后开源。还有保罗·艾伦资助的团队，艾伦人工智能研究所也在做。他们也在做很好的工作，但他们没有预训练资源。

You know, the American bench there was always kinda thin. Right? It was basically Meta that was willing to put in huge amounts of money and resources and then open source it. You've got, you know, Paul Allen funded group, the Allen Institute for AI AI too. You know, they're they're doing good stuff too, but they don't have pretraining resources.

Speaker 0

所以他们确实在做非常好的后训练，并开源他们的配方之类的东西。所以并不是说美国的开源不好。你知道吧？而且，还有另一种方式可以验证事情发展得很快。因为如果你拿美国最好的开源模型和一年前相比，它们可能和当时市面上任何商业模型一样好，甚至略胜一筹。

So they do, you know, really good post training and and open source their recipes and all that kind of stuff. So it's not like American open source is bad. You know? And, again, it's a the tie the there's another way in which I think you can really validate that things are moving quickly. Because if you take the best American open source models and you take them back a year, they are probably as good, if not a little better, than anything that we had commercially available at the time.

Speaker 0

如果与中国相比，你知道，我认为他们已经超越了。所以在前沿领域发生了相当明显的变化。我认为这意味着最好的中国模型明显比我们一年前的任何模型都要好，无论是商业还是其他。所以，是的，这意味着事情在发展。我希望我已经把这个观点说得很有说服力了，但这是另一个数据点，我认为它让人难以——我不认为你可以同时相信中国模型现在是最好的开源模型，又相信AI已经停滞不前，自GPT-4以来没有太大进展。

If you compare to Chinese, you know, they have, I think, surpassed. So there's been, like, pretty clear change at the frontier. I think that means that the best Chinese models are, like, pretty clearly better than anything we had a year ago, commercial or otherwise. So, yeah, I mean, that just means, like, things are moving. I think that's, like hopefully, I've, made that case compellingly, but that's another data point that I think makes it hard to you I don't think you can believe both that, the Chinese models now the best open source models and that AI has stalled out, and we haven't seen much progress since g p t four.

Speaker 0

这些观点似乎有些矛盾。我认为错误的一方是缺乏进展的说法。至于这意味着什么，我真的不知道。我们无法阻止中国。是的。

Like, those seem to be kind of contradictory notions. I believe the the one that is wrong is the lack of progress. In terms of what it means, I mean, I don't really know. It's we're not gonna stop China. Yeah.

Speaker 0

总的来说，我一直对不向中国出售芯片这件事持怀疑态度。最初的设想是，我们要阻止他们进行一些超级尖端的军事应用。但后来发现，我们其实无法真正阻止这一点，但至少可以阻止他们训练前沿模型。然后又意识到，我们也不一定能真正阻止那个。但现在我们至少可以阻止他们拥有大量的人工智能代理。

The the whole I've always been a skeptic of the no selling chips to China thing. The Notion originally was like, we're gonna prevent them from doing, you know, some super cutting edge military applications. And it was like, well, we can't really stop that, but we can at least stop them from training frontier models. And then it was like, well, we can't necessarily really stop that. But now we can, you know, at least keep them from, like, having tons of AI agents.

Speaker 0

我们将拥有比他们多得多的AI代理。我一点也不喜欢这种思路。但一个潜在的好处是，他们可能没有足够的计算能力来为世界其他地区提供推理即服务。所以，他们最多只能做到说，好吧，我们来训练这些东西，然后你们自己想办法。

We'll we'll have, like, way more AI agents than they do. And I don't love that line of thinking really at all. But one upshot of it potentially is they just don't have enough compute available to provide inference as a service, you know, to the rest of the world. So instead, the best they can do is just say, okay. Well, we'll train these things, and, you know, you can figure it out.

Speaker 0

给你，拿去用吧。这有点像是一种软实力的策略，大概是这样。我在5月16日与Anjaney做了一期节目，我认为他很好地提供了我称之为第三到第一百九十三国家的视角。如果美国和中国是第一和第二，那么第三名之后有一个巨大的差距。

Here here you go. Like, have at it. It's kind of a soft power play, presumably. I did an episode with, Anjaney from May 16 z who I I thought really did a great job of providing the perspective of what I what I've started calling countries three through one ninety three. If The US and China are one and two, three through there's a big gap.

Speaker 0

你知道吗？我认为美国仍然领先，但在研究和想法方面相对于中国优势并不那么大。我们确实有计算优势，而且这似乎很重要。一个可能的结果是他们在开源。而第三到第九十三，或者说第三到第一百九十三国家则明显落后。

You know? There's like I think The US is still ahead, but not by that much in terms of research and, you know, ideas relative to China. We do have this compute advantage, and that does seem like it matters. One of the upshots may be that they're open sourcing. And countries three through 93 are, like or three through one ninety three are significantly behind.

Speaker 0

所以对他们来说，这是一种试图在中美竞争中拉拢更多国家到中国阵营的方式。似乎每个人都在模仿，我一点也不喜欢这样。我不喜欢技术脱钩。作为一个担心谁是真正的“他者”的人，我总说真正的“他者”是AI，而不是中国人。

So for them, it's a way to, you know, try to bring more countries over to the Chinese camp, potentially in The US China rivalry. Seems like the model everybody and I don't like this at all. I I I don't like technology decoupling. As somebody who worries about, you know, who's the real other here? I always say the the real other are the AIs, not the Chinese.

Speaker 0

所以如果我们最终陷入一种糟糕的情况，比如看到一些疯狂的事情，如果我们基本上处于相同的技术范式下会很好，但如果我们真的脱钩，不仅仅是芯片不同，可能连想法也开始变得非常不同。出版被关闭。技术树演化并逐渐分道扬镳。在我看来，这就像是让情况变得更糟的配方，更难了解对方拥有什么。

So if we do end up in a situation where, yikes. Like, you know, we're seeing some crazy things, it would be really nice if we were on basically the same technology paradigm to the degree that we really decouple and, you know, not just the chips are different, but maybe the ideas start to become very different. Publishing gets shut down. You know, tech trees evolve and and kind of grow apart. That to me seems like a recipe for you know, it's harder to know what the other side has.

Speaker 0

更难互相信任。这似乎助长了军备竞赛的动态，我认为这确实是一个真实的存在性风险因素。我讨厌看到我们创造另一种类似疯狂的局面，让所有人都生活在AI毁灭的威胁之下，但这很可能会发生。所以，是的，我不知道。我确实对政府最近决定愿意向中国出售H20芯片有些同情。

It's harder to trust one another. It seems to feed into the arms race dynamic, which I do think would you know, is is a real, existential risk factor. I would hate to see us, you know, create another sort of mad type dynamic where we all live under the threat of AI destruction, but that very well could happen. And so, yeah, I don't know. I I I do kind of have some sympathy for the recent decision that the administration made to be willing to sell the h twenties to China.

Speaker 0

然后有趣的是，他们居然转身拒绝了他们，在我看来这似乎是个错误。我不知道他们为什么要拒绝。如果我是他们，我会买下来。也许我会把我刚创建的模型用于推理服务并出售，试图通过这种方式收回成本。但与此同时，他们至少可以展示中华民族的伟大，表明自己并未落后于前沿技术，还能对第三到第一百九十三号国家发出强有力的呼吁，比如说，你看。

And then it was funny that they turned around and rejected them, which to me seemed like a mistake. I don't know why they would be rejecting them. If I were them, I would buy them. And I would maybe I would maybe sell inference on the models that I've just been creating, and I would try to make my money back doing that. But in the meantime, they can at least, you know, demonstrate the greatness of the Chinese nation by showing that they're not far behind the frontier, and they can also make a pretty powerful appeal to countries three through one ninety three and say, like, you know, look.

Speaker 0

你真的想看看美国总体上是如何行事的吗？你知道，他们真的想切断我们的芯片供应。上一届政府甚至有一个更长的禁止获取芯片的国家名单。这届政府正在做各种疯狂的事情。你明白吗？

You really wanna you see how The US is acting in general? You know, you really wanna they they cut us off from chips. They had a even a long you know, the last administration had an even longer list of countries that couldn't get chips. This administration is doing all kinds of crazy stuff. You You know?

Speaker 0

这里那里征收50%的关税，随便啦。你怎么知道你真的能依赖他们未来继续为你提供AI？嗯，你可以依赖我们。我们开源模型，你可以拥有它。

You get 50% tariffs here, there, whatever. How do you know you can really rely on them to continue to provide you AI into the future? Well, you can rely on us. We open source the model. You can have it.

Speaker 0

所以，你知道，来和我们合作，买我们的芯片吧。因为顺便说一句，随着我们的成熟，我们的模型将优化以在我们的芯片上运行。所以我不知道。那是复杂的事情，复杂的情况。我确实认为这是真的。

So, you know, come work with us and buy our chips. Because by the way, our models will you know, as we mature, they'll be optimized to run on our chips. So I don't know. That's a complicated stuff a complicated situation. I do think it's true.

Speaker 0

我不认为采用率有80%那么高。我认为，你知道，在那个使用开源模型做事情的子集公司中。我们会在Waymark尝试，但说实话，到目前为止，我们的产品中从未使用过开源模型。我们做的一切都是通过商业途径。目前，我们打算尝试做一些强化微调。

I I don't think the adoption is as high as that 80%. I think that is, you know, within that subset of companies that are doing stuff with open source. We're gonna experiment with that at Waymark, but we to to be honest, we have never done anything with an open source model in our product to present. Everything we've ever done has been through commercial. At this point, we are going to try doing some reinforcement fine tuning.

Speaker 0

我想我们会先在一个Quinn模型上做。所以，你知道，那将使我们进入那80%。但我猜最终，我们会采用那个Quinn模型，进行强化微调，大概能达到像GPT-5或COD-4或类似的水平。然后我们会说，好吧。

We are gonna do that on a Quinn model, I think, first. So, you know, we'll that'll put us in that 80%. But I'm guessing that at the end of the day, we'll take that Quinn model. We'll do the reinforcement fine tuning, and we'll probably get roughly up to as good as, you know, g p d five or COD four or whatever. And then we'll say, okay.

Speaker 0

我们真的想自己管理推理吗？我们到底能省多少钱？最终，我猜我们可能还是会选择每月多付一点钱，使用这些前沿模型或可能稍好一点的模型，你知道，操作上会容易得多。而且它们会有升级。对吧？

Do we really want to have to manage inference ourself? How much are we really gonna save? And at the end of the day, I would guess we probably are still gonna end up just being like, we'll pay a little bit more on a monthly bill basis for one of these frontier models or a little bit better maybe still, and, you know, it's operationally a lot easier. And they'll have upgrades. You know?

Speaker 0

所以，是的，我的意思是，当然，存在受监管的行业。有很多地方，你知道，你面临硬性约束，根本无法绕过，这迫使你采用那些中国式的模型。然后还会有一个问题，比如，它们是否有后门？你知道，人们已经看到了‘休眠特工’项目，一个模型被训练得在某个时间点之前表现良好。

So, yeah, I mean, of course, there's regulated industries. There's all there's a lot of places where, you know, you you have hard constraints. You just can't get around, and that forces you to do those Chinese thing Chinese models. Then there's also gonna be the question of, like, are there backdoors in them? You know, people have seen the sleeper agents project where a model was trained to be good up until a certain point of time.

Speaker 0

而且，你知道，人们总是在系统提示中输入今天的日期。对吧？今天是这个日期。你是Claude。你知道，给你。

And, you know, people put the today's date in the system prompt all the time. Right? Today's date is this. You are Claude. You know, here you go.

Speaker 0

所以这将是人们需要担心的另一种事情。我们并没有很好的方法，但已经有一些研究。Anthropic做了一个项目，他们训练模型拥有一些隐藏目标，然后挑战团队找出这些隐藏目标是什么。通过某些可解释性技术，他们能够相对快速地弄清楚这些。所以你可能能够获得足够的信心，拿这个开源的东西，你知道，由某个中国公司创建之类的，然后进行某种并非严格审计（因为你无法精确追踪发生了什么），但某种检查，看看我们能否检测到任何隐藏目标或任何秘密后门、不良行为等等？

So then that that's gonna be another kind of thing for people to worry about. And we don't really have great there there have been some studies. Anthropic did a thing where they trained models to have some hidden objectives and then challenged teams to figure out what those hidden objectives were. And with certain interpretability techniques, they were able to figure that stuff out relatively quickly. So you might be able to get enough confidence that you take this open source thing, you know, created by some Chinese company, whatever, and then put it through, you know, some sort of not exactly audit because you can't trace exact exactly what's happening, but some sort of examination, you know, to see, can we detect any hidden goals or any, you know, secret backdoor or bad behavior or whatevers?

Speaker 0

也许通过足够多的这类工作，你可以确信你没有它。但随着这些东西变得越来越关键，你知道，再次回到那个任务链接翻倍、奇怪行为的问题，现在你还得考虑，如果他们故意编程让它在一定罕见情况下做某些坏事呢？我们正走向一个非常奇怪的未来。你知道，所有这些事情没有限制。你知道吗？

And maybe with enough of that kind of work, you could be confident that you don't have it. But the more and more critical this stuff gets, you know, again, going back to that task link doubling, weird behavior, now you gotta add into the mix, what if they intentionally programmed it to do certain bad things under certain, you know, rare circumstances? We're just headed for a really weird future. You know, that we've got all these there's there's no limit to it. You know?

Speaker 0

所有这些都是合理的担忧。它们常常相互直接冲突。我不是那种，你知道，希望看到一家科技公司以任何方式接管世界的人，所以我确实认为，拥有某种更广泛、更有缓冲的、类似生态系统的体系会很好，所有AI在某种竞争中相互共存。但我们并不真正知道那是什么样子，我们也不真正知道，你知道，我们不知道入侵物种可能是什么样子，当它被引入那个非常初生且尚未经过实战检验的生态时。所以，是的，我不知道。

All these things are valid concerns. They often are in direct tension with each other. I don't I I'm not one who, you know, wants to see one tech company take over the world by any means, So I I definitely think we would do really well to have some sort of broader, more buffered, ecological like system where, you know, all the AIs are kind of in some sort of competition, you know, mutual coexistence with each other. But we don't really know what that looks like, and we don't really know, you know, we don't really know what an invasive species might look like, you know, when it gets introduced into that very, you know, nascent and as yet, like, not battle tested, ecology. So, yeah, I don't know.

Speaker 0

总之，我认为未来将会非常、非常奇怪。

Bottom line, I think the future's gonna be really, really weird.

Speaker 2

是的。嗯，我确实想以一个积极的调子结束。所以也许作为一个担保人的结束问题，我们可以探讨一些我们已经看到一些令人兴奋的能力出现并改变体验的领域，也许是教育或医疗保健，或者任何其他你想强调的领域。

Yeah. Well, I I do wanna close on a on a uplifting note. So maybe maybe as a as a guarantor's closing question, we could get into some areas where we're already seeing some some exciting capabilities emerge and transform the experience, maybe around education or healthcare or any other areas you want to highlight.

Speaker 0

是的。天哪，一切都变了。我有个信条是，对于有动力的学习者来说，从未有过比现在更好的时代。是的。所以我认为很多事情确实都有两面性，你知道的。

Yeah. Boy, it's all over. One of my mantras is that there's never been a better time to be a motivated learner. Yeah. So I I think a lot of these things do have kind of, you know, two sides of the coin.

Speaker 0

一方面担心学生会走捷径，失去持续专注和承受认知负荷的能力。另一方面，作为一个对AI与生物学交叉领域着迷的人，有时我想读一篇生物学论文，但确实没有背景知识。一个很棒的做法是开启语音模式，与ChatGPT共享屏幕，然后通读论文。大多数时候你甚至不用和它说话，你就是在自己阅读。

There's the worry that the students are taking the shortcuts, and they're, you know, losing the ability to sustain focus and endure cognitive strain. Flip side of that is, as somebody who's fascinated by the intersection of AI and biology, sometimes I wanna read a biology paper, and I really don't have the background. An amazing thing to do is turn on voice mode and share your screen with ChatGPT and just go through the paper reading. It's you don't even have to talk to it most of the time. You're doing your reading.

Speaker 0

它就像在你肩膀上看着。然后你可以在任何随机时刻口头提问，这是什么？他们为什么谈论那个？这个是怎么回事？他们提到的这个特定蛋白质的作用是什么，或者诸如此类的问题？

It's watching over your shoulder. And then at any random point you have a question, you can verbally say, what's this? Why why are they talking about that? What's going on with this? What is the role of this particular protein that they're referring to or whatever?

Speaker 0

它会为你提供答案。所以如果你真心想学习，这些东西在帮助你学习方面好得难以置信。另一方面是你可以走很多捷径，也许永远不必学习东西。在生物学前沿，你知道，我们又有多个这类发现。我们之前讨论过抗生素那个。

And it will have the answers for you. So if you really want to learn in a sincere way, you know, the the things are unbelievably good at helping you do that. Flip side is you can take a lot of shortcuts and, you know, maybe never have to learn stuff. On the biology front, you know, again, like, we've got multiple of these sort of discovery things happening. The antibiotics one we covered.

Speaker 0

还有一个是我在另一期节目中与斯坦福大学教授James Zhao讨论的，他创建了一个叫虚拟实验室的东西。基本上，这是一个AI智能体，可以根据给定的问题类型生成其他AI智能体。然后它们会进行一个审议过程，比如一个领域的专家会提出看法，然后它们来回讨论。其中还有一个批评者会批评提出的想法。最终，它们进行综合。

There was another one that I did another episode on with a a Stanford professor named James Zhao who created something called the virtual lab. And, basically, this was an AI agent that could spin up other AI agents depending on what kind of problem it was given. Then they would go through a deliberative process where you'd have, you know, one expert in one thing would give its take, and they'd, you know, bat it back and forth. There was a critic in there that would criticize, you know, the ideas that had been given. Eventually, they synthesize.

Speaker 0

它们还被赋予了一些这些专门的专家工具。所以你有智能体使用AlphaFold类型的工具，不仅仅是AlphaFold。你知道，现在有各种各样这样的工具。但使用那种东西来说，好吧，我们能模拟这个与那个如何互动吗？智能体在运行那个循环，他们能够让这个语言模型智能体与专业工具系统生成针对新型COVID毒株的新治疗方法，这些毒株已经逃逸了之前的治疗。

Then they were also given some of these narrow specialist tools. So you have agents using the AlphaFold type, not just AlphaFold. You know, there's a whole whole wide wide, array of those at this point, But using that type of thing to say, okay. Well, can we simulate, you know, how this would interact with that? Agents are running that loop, and they were able to get this language model agent with specialized tool system to generate new treatments for novel strains of COVID that had, you know, kind of escaped the previous treatments.

Speaker 0

太神奇了，对吧？我的意思是，当然，另一方面是，你知道，会有生物武器风险。所以所有这些事情看起来，即使在丰裕本身的前沿，也是如此。对吧？

Amazing stuff. Right? I mean, the flip side of that, of course, is, you know, you get the bioweapon risk. So all these things do seem like they're gonna be even even on just the abundance front itself. Right?

Speaker 0

比如，我们或许会拥有一个拥有无限专业私人司机的世界，但我们并没有一个很好的计划来安置目前从事这项工作的500万人。我们或许拥有无限的软件，但你知道吗，尤其是当那500万司机涌入所有编程训练营并找到编码工作后，我不知道我们该如何处理那1000万原本在编码的人，当其中900万人变得多余时。所以，是的，我不知道。我觉得我们正走向一个奇怪的世界。没有人真正知道五年后会是什么样子。

Like, we may have a world of unlimited professional private drivers, but we don't really have a great plan for what to do with the 5,000,000 people that are currently doing that work. We may have infinite software, but, you know, when especially once the 5,000,000 drivers pile into all the coding boot camps and, you know, get coding jobs, I don't don't know what we're gonna do with the 10,000,000 people that were coding when, you know, 9,000,000 of them become superfluous. So, yeah, I don't know. Think we're we're headed for a weird world. Nobody really knows what it's gonna look like in five years.

Speaker 0

在谷歌的IO大会上有一个很棒的时刻，他们请上了一位记者。我知道你们——我知道我们对记者持怀疑态度。但这是一个绝佳的时刻，我们要直接沟通。对吧？这是一个很好的理由或例子，说明为什么有人想这么做。

There was a great moment at at Google's IO where they brought up some journalist. I know you I know we we're skeptical of journalists. This is a great moment to, we're going direct. Right? This is a great reason or example of why one would wanna do that.

Speaker 0

他们请上这个人来采访德米斯和谢尔盖·布林。那家伙问，比如，搜索在五年后会是什么样子？谢尔盖·布林几乎在台上喷出咖啡，他说，搜索？我们连五年后世界会是什么样子都不知道。所以我认为这真的很真实。

They brought up this person to interview Demis and Sergey Brannan. They the guy asked, like, what is search gonna look like in five years? And Sergey Brannan, like, almost spit out his coffee on the on the stage and was like, search? We don't know what the world is gonna look like in five years. So I think that's really true.

Speaker 0

比如，我认为对我们许多人来说，包括我自己在内，最大的风险是思维太局限。你知道，我认为我们能做的最糟糕的事情就是低估这件事能走多远。我宁愿因为事情发生的时间比我预期的长一倍而被嘲笑，也不愿在事情发生时措手不及。所以不管是27年、29年还是31年，老实说，我会接受我们能得到的任何额外缓冲。我的想法就是尽可能多、尽可能快地做好准备。

Like, the biggest risk I think for so many of us, and I, you know, include myself here, is thinking too small. You know, the the the worst thing I think we could do would be to underestimate how far this thing could go. I would much rather be I would much rather be mocked for things happening on twice the time scale that I thought than to find myself unprepared when they do happen. So whether it's 27, 29, 31, I'll take that extra buffer, honestly, where we can get it. My thinking is just get, you know, get ready as as much and as fast as possible.

Speaker 0

而且，再次强调，如果我们确实有一点宽裕的时间来做额外的思考，那很好。但我认为我们能犯的最严重的错误就是轻视它，不觉得我们需要为重大变革做好准备。

And, again, if we do have a little grace time to, you know, to do extra thinking, then great. But I would I think the worst mistake we could make would be to dismiss and and not feel like we need to get ready for big changes.

Speaker 2

我们应该直接就此结束，还是你有什么其他最后一点想确保传达的，关于我们今天讨论的任何内容？

Should we wrap directly on that, or is there any other last note you wanna make sure to get across regarding anything we we said today?

Speaker 0

这些天我的另一个座右铭是：最稀缺的资源是对未来的积极愿景。是的。我确实认为这总是很引人注目，无论是谢尔盖，还是萨姆·奥特曼或达里奥。比如，达里奥可能拥有最好的积极愿景，作为前沿开发者CEO，带着充满爱与优雅的机器。但让我始终感到惊讶的是，关于这些事情的具体细节如此之少。

One of my other mantras these days is the scarcest resource is a positive vision for the future. Yeah. I do think it's always really striking, whether it's Sergei or, you know, Sam Altman or Dario. Like, Dario probably has the best positive vision of the frontier developer CEOs with machines of love and grace. But it's always striking to me how little detail there is on these things.

Speaker 0

当他们推出GPT-4o（语音模式）时，他们相当坦率地表示，这确实受到了电影《她》的启发。所以我认为，即使你不是研究人员，不擅长数学，也不是编程的人，但这波技术浪潮真的奖励玩乐和想象力。我觉得写小说可能是你能做的最有价值的事情之一，尤其是如果你能写出激励人心的小说，让前沿公司的人思考：天哪，也许我们可以把世界引向那个方向。如果你能在人们心中种下这样的种子，即使它完全来自非技术领域，也可能产生真正的影响力。

And when they launched g p t four o, which was the voice mode, they were pretty upfront about saying, yeah, this was kind of inspired by the movie Her. And so I do think, like, even if you are not a researcher, you know, not great at math, not somebody who codes, I think that this technology wave really rewards play. It really rewards imagination. I think literally writing fiction might be one of the highest value things you could do, especially if you could write aspirational fiction that would get people at the frontier companies to think, jeez, maybe we could steer the world in that direction. Like, wouldn't that be great If you could plant that kind of seed, in people's minds, it could come from a totally nontechnical place and potentially be really impactful.

Speaker 0

玩乐小说还有另一个层面。但没错，玩乐小说、对未来的积极愿景。任何你能做的提供积极——哦，行为方面也是。就像现在，因为AI编码能力这么强，我开始看到一些从未编过程的人。我现在正和一个从未编过程但有点行为科学背景的人合作。

Play fiction had one other dimension to that. But, yeah, play fiction, positive vision for the future. Anything that you could do to offer a positive oh, behavioral too. It's like these days, because you can get the ass to code so well, I'm starting to see people who have never coded before. I'm working with one guy right now who's never coded before, but does have a sort of behavioral science background.

Speaker 0

他开始在各种深奥情境下对我们的AI行为进行真正的前沿研究。所以我认为，没有人应该认为自己无法为理解和塑造这一现象做出贡献。这不仅仅是技术头脑才能参与的事情。哲学家、小说作家、纯粹瞎捣鼓的人、越狱者普林尼——真的，几乎有无限多样的认知类型都能为理解AI现状的讨论增添宝贵价值。

And he's starting to do legitimate frontier research on how our AI is gonna behave under various kind of esoteric circumstances. So I think nobody should count themselves out from the ability to contribute to figuring this out and even to shaping this phenomenon. It is not just something that the, you know, the technical minds can contribute to at this point. Literally, philosophers, fiction writers, people literally just messing around, Pliny the jailbreaker. You know, there's there are almost unlimited cognitive profiles that would be really valuable to add to the mix of people trying to figure out what's going on with AI.

Speaker 0

所以我的态度是：欢迎大家各显神通。

So come one, come all is kind of my attitude on that.

Speaker 2

这是个很好的收尾点。Nathan，非常感谢你来参加播客。

That's a great place to to wrap. Nathan, thank you so much for coming on the podcast.

Speaker 0

谢谢你，Eric。很愉快。

Thank you, Eric. It's been fun.

Speaker 1

感谢收听本期a16z播客。如果你喜欢这一期，请务必点赞、评论、订阅、给我们评分或评论，并与亲友分享。更多节目请访问YouTube、Apple Podcasts和Spotify。在X上关注我们@a16z，订阅我们的Substack：a16z.substack.com。再次感谢收听，下期再见。

Thanks for listening to this episode of the a 16 z podcast. If you like this episode, be sure to like, comment, subscribe, leave us a rating or a review, and share it with your friends and family. For more episodes, go to YouTube, Apple Podcasts, and Spotify. Follow us on x at a sixteen z, and subscribe to our Substack at a16z.substack.com. Thanks again for listening, and I'll see you in the next episode.

Speaker 1

请注意，此处内容仅供信息参考，不应视为法律、商业、税务或投资建议，也不应用于评估任何投资或证券，且并非针对任何a16z基金的投资者或潜在投资者。另请注意，a16z及其关联机构可能持有本播客讨论公司的投资。更多详情，包括我们的信息披露链接，请参阅a16z.com/disclosures。

As a reminder, the content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any a sixteen z fund. Please note that a sixteen z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our please see a 16z.com/disclosures.