RLVR革命——与Nathan Lambert（AI2，Interconnects.ai）对话

本集简介

章节 00:00:00 欢迎与嘉宾介绍 00:01:18 Tulu、OVR与RLVR发展历程 00:03:40 行业对后训练与偏好数据的处理方式 00:06:08 理解RLVR及其影响 00:06:18 智能体、工具使用与训练环境 00:10:34 开放数据、人类反馈与基准测试 00:12:44 聊天机器人竞技场、迎合倾向与评估平台 00:15:42 RLHF对比RLVR：理论、算法与未来方向 00:17:54 前沿模型：推理、混合模型与数据 00:22:11 搜索、检索与新兴模型能力 00:29:23 工具使用、课程设计与模型训练挑战 00:38:06 智能体模型中的技能、规划与抽象能力 00:46:50 并行化、验证器与扩展方案 00:54:33 强化学习中的过优化与奖励设计 01:02:27 开放模型、个性化与模型规范 01:06:50 开放模型生态与基础设施 01:13:05 元技术、硬件与AI竞争的未来 01:15:42 构建开放DeepSeek的愿景与结语 Nathan首次做客时为我们深入解读了RLHF（人类反馈强化学习），如今他再次回归，带我们了解RLVR（可验证奖励强化学习）的演进历程——这一概念首次提出于他的Tulu 3论文。虽然RLHF仍是基础方法，但RLVR已成为训练具有明确成功标准任务的有力手段，其通过可验证的客观函数作为奖励信号，特别适用于数学、代码正确性和指令遵循等领域。与仅依赖主观人类反馈不同，RLVR利用确定性信号指导优化，使其更具扩展性，并在许多领域可能更可靠。不过Nathan指出，RLVR仍在快速发展中，尤其是在处理工具使用和多步推理方面。我们还探讨了Tulu模型系列——这是AI2开发的指令调优开放模型家族。Tulu旨在为开放社区提供可复现、先进的训练后方案。与OpenAI或Anthropic等依赖庞大且通常专有数据集的前沿实验室不同，Tulu致力于提炼并普及指令和偏好调优的最佳实践。我们惊讶地发现，通过精简评估集、谨慎选择任务和透明方法，Tulu在特定基准测试中甚至能与顶尖专有模型媲美。最引人入胜的话题之一是将工具使用融入强化学习框架的挑战。Lambert强调，虽然可以通过提示让模型使用搜索或代码执行等工具，但通过强化学习让模型可靠地学会何时及如何使用这些工具则困难得多。设计避免过优化的奖励函数更是雪上加霜——模型可能学会“玩弄”奖励信号而非解决实际任务。这在代码生成中尤为突出，模型可能通过插入无效语句来欺骗单元测试而非生成正确逻辑。随着模型日益具备智能体特性，需要在多工具间规划、检索和行动，奖励设计成为关键瓶颈。其他讨论主题包括： - 从RLHF到RLVR的演进 - Tulu模型的目标与技术架构，包括开源训练后方案的动机 - 工具使用在RL中的挑战：可验证性、奖励设计与跨领域扩展 - 评估框架及Chatbot Arena等平台的作用 - 前沿领域中混合推理模型与统一推理模型的战略博弈 - 智能体的规划、抽象与校准能力及其重要性 - 开源AI模型的未来，包括DeepSeek、OLMo及“美国版DeepSeek”的潜力 - 模型个性化、角色调优与模型规范范式的重要性 - RL中的过优化现象及其在不同领域（控制任务、代码、数学）的表现 - 推理时扩展与模型并行化的行业趋势节目最后展望了开源AI的未来。Nathan提出构建“美国版DeepSeek”的愿景——一个完全开放、端到端具备推理能力的模型，包含透明的训练数据、工具和基础设施。他强调开源AI不仅是公开权重，更需发布方案、评估和方法，降低所有人构建和理解尖端系统的门槛。

Chapters 00:00:00 Welcome and Guest Introduction 00:01:18 Tulu, OVR, and the RLVR Journey 00:03:40 Industry Approaches to Post-Training and Preference Data 00:06:08 Understanding RLVR and Its Impact 00:06:18 Agents, Tool Use, and Training Environments 00:10:34 Open Data, Human Feedback, and Benchmarking 00:12:44 Chatbot Arena, Sycophancy, and Evaluation Platforms 00:15:42 RLHF vs RLVR: Books, Algorithms, and Future Directions 00:17:54 Frontier Models: Reasoning, Hybrid Models, and Data 00:22:11 Search, Retrieval, and Emerging Model Capabilities 00:29:23 Tool Use, Curriculum, and Model Training Challenges 00:38:06 Skills, Planning, and Abstraction in Agent Models 00:46:50 Parallelism, Verifiers, and Scaling Approaches 00:54:33 Overoptimization and Reward Design in RL 01:02:27 Open Models, Personalization, and the Model Spec 01:06:50 Open Model Ecosystem and Infrastructure 01:13:05 Meta, Hardware, and the Future of AI Competition 01:15:42 Building an Open DeepSeek and Closing ThoughtsWe first had Nathan on to give us his RLHF deep dive when he was joining AI2, and now he’s back to help us catch up on the evolution to RLVR (Reinforcement Learning with Verifiable Rewards), first proposed in his Tulu 3 paper. While RLHF remains foundational, RLVR has emerged as a powerful approach for training models on tasks with clear success criteria and using verifiable, objective functions as reward signals—particularly useful in domains like math, code correctness, and instruction-following. Instead of relying solely on subjective human feedback, RLVR leverages deterministic signals to guide optimization, making it more scalable and potentially more reliable across many domains. However, he notes that RLVR is still rapidly evolving, especially regarding how it handles tool use and multi-step reasoning.We also discussed the Tulu model series, a family of instruction-tuned open models developed at AI2. Tulu is designed to be a reproducible, state-of-the-art post-training recipe for the open community. Unlike frontier labs like OpenAI or Anthropic, which rely on vast and often proprietary datasets, Tulu aims to distill and democratize best practices for instruction and preference tuning. We are impressed with how small eval suites, careful task selection, and transparent methodology can rival even the best proprietary models on specific benchmarks.One of the most fascinating threads is the challenge of incorporating tool use into RL frameworks. Lambert highlights that while you can prompt a model to use tools like search or code execution, getting the model to reliably learn when and how to use them through RL is much harder. This is compounded by the difficulty of designing reward functions that avoid overoptimization—where models learn to “game” the reward signal rather than solve the underlying task. This is particularly problematic in code generation, where models might reward hack unit tests by inserting pass statements instead of correct logic. As models become more agentic and are expected to plan, retrieve, and act across multiple tools, reward design becomes a critical bottleneck.Other topics covered:- The evolution from RLHF (Reinforcement Learning from Human Feedback) to RLVR (Reinforcement Learning from Verifiable Rewards)- The goals and technical architecture of the Tulu models, including the motivation to open-source post-training recipes- Challenges of tool use in RL: verifiability, reward design, and scaling across domains- Evaluation frameworks and the role of platforms like Chatbot Arena and emerging “arena”-style benchmarks- The strategic tension between hybrid reasoning models and unified reasoning models at the frontier- Planning, abstraction, and calibration in reasoning agents and why these concepts matter- The future of open-source AI models, including DeepSeek, OLMo, and the potential for an “American DeepSeek”- The importance of model personality, character tuning, and the model spec paradigm- Overoptimization in RL settings and how it manifests in different domains (control tasks, code, math)- Industry trends in inference-time scaling and model parallelismFinally, the episode closes with a vision for the future of open-source AI. Nathan has now written up his ambition to build an “American DeepSeek”—a fully open, end-to-end reasoning-capable model with transparent training data, tools, and infrastructure. He emphasizes that open-source AI is not just about weights; it’s about releasing recipes, evaluations, and methods that lower the barrier for everyone to build and understand cutting-edge systems. It would seem the

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

大家好，欢迎收听《Lead and Space》播客。我是Alessio，Decibel的合伙人兼首席技术官，今天和我一起的是Small AI的创始人Zwix。

Hey, everyone. Welcome to the Lead and Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by Zwix, founder of Small AI.

Speaker 1

大家好，大家好。我们非常高兴地再次欢迎来自AI two的Nathan Lambert。欢迎你。

Hello. Hello. And we're excited to welcome back Nathan Lambert from AI two. Welcome.

Speaker 2

谢谢，很高兴来到这里。

Thanks. Fun to be here.

Speaker 1

我觉得我还得提一下互联活动，比如播客之类的，还有你和AIU世界博览会的合作。过去一年半里你真是做了不少事情。

I feel like I also have to say interconnects and, like, the likes of podcasts and, like, you and the AIU World's Fair. Like, you've just done a lot in the last year and a half.

Speaker 2

其实没那么多。我还是拒绝了不少邀约的。

Not that many. Still say no to plenty of things. Yeah.

Speaker 1

你第一次参加我们的节目是2024年1月刚加入AI two时，后来你发布了All the Almost。NeurIPS大会上你又和我们一起做了开放模型的环节？哦不对，是Luca主讲而你协助。最近你又在旧金山参加了AIE大会。首先恭喜你获得最佳演讲者奖项。

Your first episode with us us was January 2024 when you just joined AI two, then you released All the Almost. Do you join us again at NeurIPS where you did the open models? Oh, well, Luca Luca did and you you supported. And then you're more recently here in in SF for AIE. First of all, I wanted to congratulate you on winning the best speaker.

Speaker 2

啊，是吗？谢谢。

Oh, yeah? For Thank you.

Speaker 1

推理赛道的最佳演讲者。给你看看。我被Mochi限制了。

The reasoning track here. Here you go. I I'm I'm limited by Mochi.

Speaker 2

哇，AI生成的啊。我看起来太禅意了，这张AI生成的画像里我显得特别超脱。

Oh, it's nice AI generated. I look too Zen. I look so Zen in this AI generated.

Speaker 1

我们让赛道主持人拍了你演讲时的照片。后来做成了吉卜力风格。但这张你的眼睛是闭着的。

We had our track host, like, take photos of you while you're speaking. Yeah. And it turned into Ghibli photos. But this one, your eyes were closed.

Speaker 2

没关系。好的。

It's fine. Okay.

Speaker 1

我们本来想让Mochi（也就是Fomsky）加入我们，但我觉得她现在非常焦虑，非常非常不安。

We we were we were trying to have Mochi, the reasoning Fomsky join us, but I think she's like very getting very anxious, very very restless.

Speaker 0

那些太疯狂了，Mochi。

Those are too crazy, Mochi.

Speaker 1

非常不安。好的。当然。好的。所以你最近的工作做得真的很出色。

Very restless. Okay. Sure. Okay. So you've been you've been doing like really good work.

Speaker 1

说实话，我觉得我们想确立的事情之一就是Tulu和ROVR。这是个好的开始吗？比如，当然。

And and honestly, like, I think one of the things that we wanted to kind of establish was Tulu and ROVR, I guess. Is that a good place to start? Like Sure.

Speaker 2

这让我们从最近的旅程开始。我认为我们可以回顾一下Tulu 2到3的目标是什么，然后它是如何融入新的叙事中的。目标是努力将我们复杂的行业后训练配方压缩成一些可操作的东西，让你可以自己修改并进行后训练，达到实际的最先进水平。我认为我们与Frontier Labs相比，任务量可能更少。

It starts us in the recent journey. I think that we can recap kind of the story of what two to three was aiming to be and then kind of how it got folded into what the new narrative is. Yeah. What the goal is is try to do the work to compress what our complicated industry post training recipes into something somewhat tractable that you can modify on your own and do post training at a what is like actual state of the art level. I think what we do relative to Frontier Labs is that we probably have a smaller amount of tasks.

Speaker 2

我认为Tulu的后训练套件大概有10到15个任务。但我猜OpenAI的后训练可能有数百个评估。增加更多评估意味着更多数据工作和更多混合工作，确保你有这些东西。但与我们的核心评估不同，我认为从8月或5B开始，基于当时的Llama，它匹配或超越了Meta在这些核心评估上的表现。

I think our post training suite for Tulu is probably like 10 to 15 tasks. But I would guess post training at OpenAI at all, you have maybe hundreds of evals. And adding more evals is more data work and more mixing work and making sure you have these things. But unlike core evals for our suite of models from, I think, August or five b is based on Lama at the time. It's like, it matches or beats meta on these core valves.

Speaker 2

我认为Meta有不同的优先级，他们为Llama 3.1做的事情（当时是一组很棒的模型）。问题在于我们如何提炼那些非常复杂的后训练解释或图表，比如Llama 3.1报告中的那些复杂反馈图，有很多迭代，早期迹象来自像Anthropic的论文，这些论文在早期有多模型变体，比如宪法AI的东西，持续多年。我认为当你进行大规模指令调整到偏好调整时，这会是什么样子，以及你可能会添加什么。我认为在我们讨论这个强化学习之前，很多核心贡献是，我们展示了如何扩展偏好数据。学术界一直在使用这个数据集，从Hugging Face模型的Zephyr beta开始，这个Ultra Feedback数据集变得流行，一年后仍然是开放偏好训练的最先进数据集。这显然不应该是唯一的选择。

I think meta has different priorities and their things for Lama 3.1, which is a great set of models at the time. And it's just like, how do we distill what is very complicated post training explanations or diagrams from the like of this Lama 3.1 report where they have these complex feedback diagrams with many iterations and earlier signs of that from, like, anthropic papers that have these multiple model variants in the early, like, constitutional AI things for multiple years and I think what does that look like when you're doing a large scale instruction tuning into preference tuning and what else you might add. I think a lot of the core contributions of that before we talk about this reinforcement learning thing is, like, we showed this how to scale up preference data. It's just like, the academic community had been using this one dataset since like all the way back in the hugging face models of like Zephyr bed beta is when this ultra feedback dataset got popular and still a year later is like this state of the art data set for open preference tuning. And it's just one of those obvious things that doesn't need to be the case.

Speaker 2

所以这是一个很大的尝试，让更成熟的配方对人们可用。我提到过这个，我想我在和Jordan谈话时提到过RLVR的起源，实际上当你在开放环境中工作时，很多是试图匹配行业已经做的事情。我们走的是不同的道路，因为我们的基础设施不同。所以OpenAI现在做的一些事情在长上下文上效果很好，但对Olmo可能不太适用，因为我们的基础模型可能没有足够的计算能力。

So it's a big trying to make more mature, recipes available to people. And I mentioned this either on one I think I'm trying to talk with Jordan. I mentioned the origin of the RLVR thing, which is like, realistically when you work in the open, a lot of it is trying to match what industry has done. And we're on a different path because our infrastructure is different. So some things that OpenAI does now that works really well for long context won't work that well for Olmo because we might not have enough flops in our base model.

Speaker 2

我们可能没有某些法律相关的数据集。但方向上，很多只是试图复制这些东西。我长期以来一直试图让OpenAI和Anthropic的John Shulman上播客。当时他已经获准和我聊天。他说的话确认了很多我在指令调整、多任务和偏好调整上的观点。

We might not have certain datasets for legal things. But directionally, like a lot of it is just trying to reproduce things. And I've like long tried to get John Shulman on the pod of OpenAI and Thropic and Outthinking Machines. And at the time, he had gotten approval to like chat with me. And it's like, what he said was, so confirming a lot of the things I had said on instruction tuning and multitask and preference tuning.

Speaker 2

然后他说，哦对，大家现在都在对输出做强化学习。这就是我们想到RLBR方法并将其扩展成通用方案的由来。当时已有许多合理或非常相似的工作，比如Vine PPO和QuietStar在数学与编程领域获取可验证奖励的研究。我认为RLVR的核心在于将其发展为通用范式。没错。

And he was like, oh, yeah, everyone just does RL on outputs. And that's how we got the RLBR idea and scale it into something that is a general method. There's a lot of reasonable or very similar works at the time like Vine PPO and QuietStar on doing these math and coding domains for getting verifiable rewards. I think the RLVR thing was about doing it in general recipes. Yep.

Speaker 2

这个命名是自然形成的——最初我们考虑的是'基于真实数据的RL'，这要特别归功于Costa Huang，他当时是我们AI团队的强化学习首席工程师，现在正在做某个隐秘的创业项目。很快你们就能听到他的新动向，我想他应该是某个项目的创始工程师。还有华盛顿大学的学生Hamish Iveson，他们主导了大部分技术工作。

And the naming was something that stuck as originally we had I think it's especially like Costa Huang who's was our kind of lead RL engineer at AI two who's doing some stealth startup now. You can hear more from him now on that soon. I think he's founding engineer of something. And Hamish Iveson, who's still a student at UW, were leading most of the technical work on this. And the naming was gonna be RL from Ground Truths.

Speaker 2

但后来我们发现'可验证奖励'其实是更普适的概念，因为只有数学问题存在标准答案，而代码验证、精确指令遵循都是可验证的。这个名称的演变很精妙，随着应用领域的扩展，它开始被广泛接受。当Jensen也开始用这个术语时，我就知道——好吧，这算是定下来了。虽然这不是我们的初衷。

But then it's like, the verifiable rewards is actually a more general notion because only like math questions have a ground truth where code is verifiable, precise instruction following is verifiable. So I think it's a nice evolution of the name which makes sense as you look at more domains which is now why it catches on with people. Like, once Jensen started using it, I was like, okay, that's that's set. That wasn't really our goal, but that's that's the

Speaker 1

所以是从那时开始流行起来的吗？

thing that's where it took off?

Speaker 2

不，在DeepSeek之后它就已经开始流行了。但当业界大佬把这个缩写放在幻灯片上时...要知道RLHF是四个字母，我们想延续这种四字母缩写的传统。

No. That was like in it being taking off because it was after DeepSeek. But it's like when people like that have the acronym on the slides. And that's it's also very clear of like, RLHF is four letters. It's like we want to evolve that and have a similar four letter acronym.

Speaker 2

这倒没什么玄机，但确实是有意为之。在这些细节上。

It's not that much magic to it, but there's definitely intention on these Sure. On these little things.

Speaker 1

RLGT可能效果不会这么好，虽然不知道为什么。

RLGT may not have worked as well. I don't know why. But yeah.

Speaker 2

没错。这就是当时大家的共识，他们最终决定改名，效果很不错，也很有趣。

Yeah. And that's like that's what people that's what these people like all that were definitely thinking and they made that name change, which works, which was fun.

Speaker 1

你刚才提到——我们会展示RLVR图表，虽然你主要引用的是Tolu论文。你说现在想调整这个命名，我们可以先预览下智能体讨论的部分内容。

You did mention so we'll we'll show you you kinda mostly quoted from the Tolu paper there, but we'll we'll show the RLVR chart. You did mention that you wanted to change it now and we'll we'll sort of preview a little bit of the agent's discussion.

Speaker 2

对。初次接触RLVR时，它本质上只是个验证函数——检查语言模型输出的字符串是否正确。现在需要思考如何描述多跳工具使用场景，这确实是当前的研究方向。

Yeah. I think when you are introduced to RLVR, it's there's just a function, really, that checks if the you have a string outputted from the language model. You have a relatively simple function that's like, is this answer from the language model correct? And there's no real environment because you're just looking at the generation. And now I need to figure out the right way to communicate what either, like, multi hop tool use looks like for this, which is something people are definitely doing.

Speaker 2

是的。我在思考，什么样的图表能概括O3模型的训练过程——实际上它们会采取多重行动，因为下一步序列依赖于环境反馈，而环境本身就是某种信息存储库。比如当它搜寻某个冷门信息时，你无法预知后续行动，除非获得Bing搜索的反馈（据称他们正是这样做的）。这个步骤确实正在发生。而当人们试图转向端到端强化学习时，环境的概念就变得非常关键——你要从多次迭代中寻找稀疏信号。

Yeah. Thinking, like, what is the right diagram to encapsulate how o three is trained, which in action they take multiple actions because the next sequence depends on the feedback from the environment, which is some sort of information store. So like when it's searching for a niche piece of information, it you you can't know what the next actions are without whatever feedback from Bing searches is what they say they use. That is a step that is very much happening. And then as people try to transition to more end to end RL is a real strong notion of environment, which is that you're looking for a sparse signal from this multiple generations.

Speaker 2

这正是人们想实现的。但当前是否真有人做到还有争议。Deep Research的博客暗示他们做了些小规模RL然后系统就神奇运转了，我认为更可能的情况是：人们对大量小任务进行训练，做些提示工程，发现当组合这些模块或对模型进行几次微调后（比如Deep Research可能包含O3的某个微调版本），再应用于不同RL领域时，系统就工作了。

And that's what people want to do. I think it's debatable whether or not people are actually doing it now. I think the deep research blog post kind of hints that they do a bunch of small scale RL and then poof the system works, which I think is much more of what's happening is people train on a bunch of small things and they do some prompting and they see that when you put these pieces together or a couple different fine tunes of a model. So it seems like deep research has some fine tune of o three in it. And so you do that with some different domains of RL.

Speaker 2

这比让Deep Research直接基于结果训练更合理，因为结果导向的RL对Deep Research就等同于RLHF重演——需要两个人类评估哪个生成报告更好。OpenAI的Sick Events Y项目显示他们在后训练中有大量奖励模型和信号，但这只是其中一环。真正的进展来自结合RL与信息检索、编辑和搜索任务。

It works rather than deep research being trained on the outcome, which I think makes a lot of sense for it not working in deep research because doing outcome based RL for deep research would be RLHF again. Because you have to have two humans and you're like, which which generated report is better? I think you can definitely do that and you the whole sick events y thing at OpenAI showed that they have so many different reward models and reward signals in their post training. But that's just one of them, and I think a lot of the progress in making it exist is doing RL and a bunch of information retrieval and editing and search tasks.

Speaker 0

我们和Noam Brown讨论过Deep Research与可验证奖励的问题。他指出这属于不可验证的RL案例。你最近的帖子也提到大实验室拥有海量数据，能从长尾问题中寻找RL切入点，组合后就能修复问题。你认为可验证性是否构成重大瓶颈？毕竟当前验证仅停留在小型原子任务层面，难以规模化？

We talked with Noam about Noam Brown about this deep research and kinda like the verifiable rewards. He mentioned, obviously, that's an example of, like, nonverifiable thing, being RL work on them. And in one of your recent posts, you also talked about how the big labs have all this data that they can find long tail things to RL on, and then kind of when you put them all together, that fixes it. Do you feel like what we're able to verify is like a big bottleneck? The, like, the verifications are only done in kind of like these smaller atomic things, and so we cannot really scale that?

Speaker 2

我那篇帖子主要探讨智能体进展与模型进展的关系。过去三年我们习惯了模型进步的叙事。重点不在于对小任务做RL，而是通过后训练修正异常行为。RL在获得正确信号时数据效率很高，但也可以直接针对非可验证问题——比如创建数百条指令加入后训练，让模型正确执行信息抽取这类软性操作。

I think my comment was on making so in this post, was like reflecting mostly on the question of what will agent progress look like relative to modeling progress. So we've had almost three years of modeling progress, and we're pretty used to the messaging on that. And it wasn't just about being at the RL on small things, but do any post training to fix a weird behavior. And RL is a very data efficient way if you can get the right signal, but you could also just say, like, it does this weird non verifiable thing. Let's create a 100 or a thousand instructions to include in post training so that the model does this types of information extraction correctly or like soft extraction.

Speaker 2

这个领域需要更多任务案例来充实。观察Claude编程时，你会发现它后台做了大量文件读取和上下文压缩——这虽非可验证能力，但若处理不当会严重影响长上下文任务。要让模型在替换多轮上下文为摘要时保持性能，需要训练创新，这也是节省上下文成本的途径。类似的新领域还有很多。

It's a space that I wanna flesh out more with more examples of tasks. It's just if you watch Claude code going, it's like, what is it doing in the background? It's a lot of reading files and even just the compressing context. Like, that's not I don't think that's really a verifiable thing, but that being messed up, like, that's a super crucial skill for long context actions and long longer tasks is just compressing well. And that's gonna take some training novelty on how do you you can effectively modify your training data instead of having all the multi turn context.

Speaker 2

使用更短的上下文还能降低成本。这类新领域层出不穷。

You just insert the summary and you wanna make the performance stay as well. It's it's also a cost saving to have shorter context. There's just a lot of new domains like that.

Speaker 0

但你觉得能在发布前发现这些问题吗？还是说实验室因拥有大量用户数据，在推理阶段就能监测这些，从而占据优势？

But do you feel like you can figure out what these things are before you release? Or do you think the labs have, like, a big advantage because they have so much user data that they can kinda, like, inspect this at inference?

Speaker 2

现阶段主要依赖真实数据。虽有明确基准测试可用，但行业正围绕不同形式的数据进行整合，这才是人们真正需要关注的重点。

I think it's mostly looking at real world data at this point. To the extent that there are clear benchmarks, you can use them in the open. But I I mean, we see the industry consolidate around data in different forms, and I I think that's a real important touch point for people.

Speaker 1

我好奇现在还有谁在收集可靠的公开数据源供大家使用。

I'm curious who's, like, still collecting reliable sources of open data that everyone uses.

Speaker 2

这个领域有很多动作，但难以取得实质性进展。是的。长期以来，偏好数据一直被认为是拥有大规模数据库会非常有益的东西。如果你想要，可以催促我尝试发布类似最终数据集的东西，但我们有来自更多模型的完成度和评分。我一直在和一位学生讨论如何标记这些数据，因为我们有太多完成度和以LM作为评判者的AI反馈数据，不知如何清理。

There's a lot of action in the space, but hard to get traction. Yeah. So I think for a long time, preference data has been something where people understand that it'd be very good to have large repositories of it. If you want that, you can, like, annoy me to try to release all like, for two d, like, have a final dataset, but we have completions and ratings from more models. Like, I've been talking to the student who's let's figure out how to mark this down because we just have so much completions and LM as a judge AI feedback data that we don't know how to clean.

Speaker 2

这是一方面。问题在于我认为很多内容都是任务和模型特定的。所以这种关于采用强化学习术语来处理偏好数据和偏好建模的概念——你希望训练这个奖励模型时所使用的生成序列，看起来像你开始微调的模型。

That's one thing. The the problem is I think a lot of it is task and model specific. So this notion of, like, on policy to adopt a RL word for just this preference data and preference modeling, which is that you want the sequences that you're training this reward model on, the sequences of generations to look like the model that you're starting to fine tune.

Speaker 1

是的。

Yeah.

Speaker 2

这使得直接套用变得困难。例如，我提到的超反馈数据集就包含了许多模型。因此大多数人在微调模型时，都能找到改进的信号。我不知道这种情况能持续多久，而且我们仍然没有解决人类反馈与AI反馈孰轻孰重的问题。每次我和前沿实验室的人交流，他们都说仍在用人类偏好数据。

That is something that has made it hard to kind of grab off the box. And it's, for example, like this ultra feedback that I mentioned is just has a lot of models in it. So most models that people are fine tuning, there's some signal for it to improve on. And I don't know how long that lasts, and we still don't have the answer question on how important human is versus AI feedback. Every time I check-in with people at Frontier Labs, they're like, yeah, we still use human preference data.

Speaker 2

我就想，好吧。我接触不到那些数据，也不知道如何衡量它实际带来的价值。可能大部分收益在于——该用什么形容词描述聊天机器人竞技场？人们现在对竞技场评价不高，但人类数据可能在提升留存时间和总体偏好上帮助很大，而大多数学者都在做多尺度和羊驼评估这类事情，这在注意力经济的混战中并不那么关键。

And I'm like, okay. I don't have access to that, and I don't know how to measure how much it gives you, really. It might be most of the benefit is on the what's the right adjective to describe chatbot arena? It's like people are down on chatbot arena, but it might be that the human data helps boost retention time and general preference a lot, where most academics were doing multiscale and alpaca eval type things, which it's just it's not as crucial to everybody's fighting in the attention economy.

Speaker 1

注意力经济。你反应真快。既然说到这儿，你提到过阿谀奉承现象，提到过Ellen Marina，那篇发在Interconnects上的文章我很喜欢。它们被过度炒作了吗？竞技场还有未来吗？

The attention economy. You're quick. I mean, since we're there, you mentioned Sycophancy, you mentioned Ellen Marina, that was one of your posts on interconnects that I really enjoyed. Are they cooked? Is there a future for arenas?

Speaker 1

接下来会怎样发展？他们现在有1亿美元资金，打算怎么用？

Like, how does this play out? You know, they got a $100,000,000 now, like, what are you gonna do?

Speaker 2

我不知道这笔钱对他们意味着什么，但我认为评估仍然有价值。尤其是在前沿领域，人们非常怀疑，但在'用最廉价模型实现不错效果'的压缩竞赛中，这对很多人仍然非常有用。

I don't know what the money does for them, but I think the the eval is still valuable. Especially at the frontier, people are very cynical, but in the compression race of how much cheap like, what is the cheapest model you can have that does pretty good at this is still so useful to a lot of people.

Speaker 1

聊天功能才是王道。

Like, chat is king.

Speaker 2

是的。我从未用这些产品运行过聊天。这就是为什么我使用g b d 4——它在聊天机器人竞技场上表现不如Yep（这个领域的新竞争者）。他们有个'氛围'类别...抱歉。

Yeah. I've never run chats with these things. It's it's why I use and g b d 4.5 isn't as good on chatbot arena. I think it's it's higher on like a Yep, which is a new competitor into this. It's like they have like a vibe category, which Sorry.

Speaker 2

是的？对。有个叫‘yep.ai’的东西。你可以查查看。是个竞争对手，另一家初创公司。

Yep? Yeah. There's like yep dot a I. There's something you can look it up. It's a competitor, another startup.

Speaker 2

他们有只猫标志...这些公司都有分类标签，其中一个分类是‘氛围感’，GPT-4.5排在榜首，我就想，好吧，这种排名确实有迹可循。

They have like a cat they all these companies have categories and one of their categories is vibes and g p t four point five is on the top and I'm like, okay, there's something like this tracks.

Speaker 1

但这是个前沿模型。

But It's a frontier model.

Speaker 2

没错。而且那些无形的东西很棒。排行榜已经建立起来了，人们还是该用它。这有点像连接产学研不同领域的社区聚焦功能。

Yeah. And it's just like, I I that stuff, intangibly, is very nice. The leaderboard is established. People still should use it. It's kind of a focusing function for the community across different batches from industry to academia.

Speaker 2

对。我不会帮他们解决盈利问题，但确立清晰的规范和一些可以无限优化的事项非常好。比如这种给模型排名的Elo机制概念

Yeah. I'm not gonna try to solve their monetization problems for them, but having clear norms and things that could be hill climb forever is very good. Like, having this idea of an EloLanking models

Speaker 1

但你们无法达到饱和状态。

But you cannot saturate.

Speaker 2

是啊。这挺酷的。这是个绝佳的问题——但你可以钻空子。

Yeah. You just It's kinda cool. It's a great it's like it's a great problem. Like, what is But you can game it.

Speaker 1

所以我认为这就是问题所在。

So I think that's the that's the issue.

Speaker 2

没错。但大家都会用多重标准评估...

Yeah. But everyone everyone evaluates on multiple things Well,

Speaker 1

Sarah Hooker最近公开发声了，我从没见过她对这类事情这么公开表态。Artificial Analysis也做了类似的事，我觉得挺有意思。另外值得注意的是，很多数据其实是单次测试，不是多轮对话。

Sarah came out, like Sarah Hooker, I've never seen her so public about any of her like, has gripes, but she doesn't really go public like that. Yeah. Artificial analysis also has one, which I I think is kinda cool. The other thing I think is relevant to this discussion is a lot of the data actually is like single test, like a single round. Like, it's not it's not multi turn.

Speaker 1

我在思考如何建立有效的多轮竞技场，因为需要切换模型。这正是Ella Marina的核心构想。

And I wonder how to create proper multi turn arenas because you have to switch the models. That's the whole premise of Ella Marina.

Speaker 2

这取决于用户数据的价值。如果用户数据持续保持同等或高于推理的价值，就会有平台不断推动其向更高成本领域发展。是的，他们很可能会建立一个深度研究竞技场——因为如果我是OpenAI开展深度研究，这正是我渴望获得的数据。而LMSYS作为拥有市场平台的实体，正是为此而设立的竞争对手。

It depends on how valuable the user data is. If the user data keeps being equally equally or more valuable than the inference, there's gonna be a platform to keep pushing this into more and more expensive things. Yeah. So they're gonna set up a deep they're I mean, they're probably setting up a deep research arena, because that's the data that I mean, if I was OpenAI working on deep research, that's the data that I want. And there are competitors, and LMSYS is the entity that has the marketplace meant to set it up.

Speaker 2

没错。这就像我对Scale的看法——他们不断攀登AI数据处理的前沿，凭借品牌效应持续提升增量评估能力，许多这样的企业都具有持久生命力。

Right. I mean, it's almost like how I see scale. It's like scale kept climbing the edge of what AI data processes is, and because they're the name brand, they keep climbing the incremental evaluation game and a lot of them have longevity.

Speaker 1

是啊。确实。

Yeah. Yeah.

Speaker 2

某种程度上这是网络效应。你提到Scale这个同样...

It's that's a it's a network effect in in some ways. You mentioned scale which is another

Speaker 1

热门话题，不过我们可以把所有热点讨论留到最后。我想先聚焦技术层面——你还在写那本RLHF的书吗？

hot topic, but, like, we'll we'll put we'll put all the sort of hot takes at the end. I do wanna, like, you know, focus. Try to be technical upfront. Try to you know, you're you're still writing the RLHF book?

Speaker 2

现在该叫RLRLVR书了吗？我可以阐述我的观点：归根结底，RLVR既不够成熟也不够有趣，所以我不打算改书名。这涉及两个原因，还有些个人职业策略考量。

Is it RLRLVR book now? I can give my spiel on it. Ultimately, RLVR is not mature enough nor is it as interesting of a book. So I'm like Okay. On two so those those are the two fronts of why I don't want to rebrand, and there's also some personal career strategy.

Speaker 2

但这应该与书籍客观质量无关。因为RLVR在未来18个月将经历巨变——新算法层出不穷，但更重要的是如何进行正确的预训练、数据获取方式、工具使用能力的涌现，这些才是RLVR的核心议题。

But that should be independent on, like, what is objectively a good book. Because our LLVR is gonna be changing so much in the next eighteen months. We've already seen it. There's all these new algorithms, but I think there's a lot more under the hood on how you do the right pre training for it and what the data is, how tool use emerges. All of this stuff is core to what our LVR will be seen as.

Speaker 2

我正在观察o3模型会是小众方案还是成为主流路径——特别是在搜索工具使用方面展现的独特风格。OpenAI的实现方式仍是谜团，这些正是RLVR书籍目前缺失的核心内容。而RLHF更具跨学科性，就像聊天机器人竞技场永远无法饱和那样。

I'm watching to see if o three is like a niche model or becomes the path that everybody needs to follow on its kind of different style of tool use that you see particularly with search. Okay. And we don't know how OpenAI did this. And these are the things that I think is kind of core to an RLVR book that we don't have. Whereas RLHF is a more interdisciplinary in the same way that chatbot arena can never be saturated.

Speaker 2

RLHF永远无法被彻底解决。我们已了解对齐、过度优化等问题本质，以及人们获取数据的管道。虽然我可以往书里添加更多RL算法（这对我的研究有好处），但这并未改变奖励建模的本质——无论是价值函数还是奖励模型等实现方式。RLHF的广度很有价值，我认为RLHF问题将具有基础性意义，其研究节奏会更稳定；而RLVR虽处爆发期，但可能被彻底解决后学术关注度归零。

RLHF can never be solved. And we kind of know these problems of alignment and over optimization and what the pipelines to getting data that people are using are. And, yes, I can add more RL algorithms to the book, which is nice for me to study, but that's not really changing it's not changing, like, oh, what reward modeling is and the different ways that people implement these today, whether it's a value function or reward model and stuff like this. So I think the the breadth on RLHF is nice, and I think I I would tell a lot of academics that I think our LHF problems are gonna be foundational and kind of just have a much more steady study rate where we're on this massive spike of RLVR, but it might just be solved. And then it just goes back to zero academically.

Speaker 2

这并非夸大其词，但可能存在一种最佳实践，能让你在任何问题上获得100%的准确率。然后争议就会转向关于偏好的讨论，这种争论将永无止境。

It's not it's it's an embellishment, but there could just be a best practice for getting a 100% accuracy on any problem that you want. Then and then it's solved to where the debate on what is a preference is gonna go on forever.

Speaker 1

是的。因为这是可验证的，确实存在正确答案。

Yeah. Because it's verifiable, there there is a right answer.

Speaker 2

没错。

Yeah.

Speaker 1

抱歉。你提到'未来18个月内会有很多变化'具体是指什么？你预见到哪些情况？其实我们先回顾下近期已经发生的事情吧。

Sorry. What what do you mean by by, like, over the next eighteen months, there'll be a lot of changes? Like, what what do you foresee? Actually, let's just catch up. Like, what's already happened, you know, in, the sort of recent history?

Speaker 2

好的。我们掌握的信息分为两类：模型在做什么，以及研究人员在做什么。我认为模型在揭示实际前沿方面提供了很多启发，比如O3、Gemini 2.5和Claude这些案例。

Yeah. So there's two categories of information that we have, which is what are the models doing and what are the researchers doing. Yeah. I think the models provide a lot of inspiration in terms of what the like, what's actual frontier is. And that's things like o three, Gemini 2.5, Claude.

Speaker 2

这些模型各有特点：O3我认为是扩展性最强的强化学习方案，而Claude和Gemini 2.5非常相似，都采用可开关的混合推理模型。它们的部署方式不同——Gemini最初没有混合推理功能但后续加入，Claude则首发就具备。最关键的问题是：O3的纯推理模型和混合推理模型哪个更有价值？

These are a mix of just o three, I think is the most scaling RL approach. And then Claude and Gemini 2.5 are very similar with hybrid reasoning models that you can turn on and off. They've they've rolled it out in different ways. So Gemini didn't have hybrid reasoning at launch, but they're given they've they've brought it in and Claude had it at launch. One of the most important questions has gotta be is is the o three path of just a reasoning model or hybrid reasoning models, like, more useful?

Speaker 2

它们的训练方法会分道扬镳吗？Nvidia的Lama Nematron推理论文可能是关于混合推理最详细的文献，而DeepSeekR1仍是纯推理模型的标准方案。这两种方法差异很大，尚不清楚哪种会胜出。此外数据侧和强化学习方法也有大量研究。

Do they diverge in their methods for training them? I think the Nvidia Lama Nematron reasoning paper is probably the most detailed paper on a hybrid reasoning thing. And then DeepSeekR one is still the canonical recipe on a, like, reasoning only model. And those are very different approaches, and I don't know if one will win out or not. And then there's just a lot of work on data side and RL methods.

Speaker 2

我认为存在一系列关于GRPo的质疑清单，其中某些数学推导确实存在问题。

I think there's a list there's a whole list of kind of GRPo complaints that are out there where the math doesn't make sense for certain things.

Speaker 1

在我看来，每篇新论文似乎都在修正gRPO。人们对其进行各种改良挺有意思，但也不确定DeepSea是否会推出R2，用下一代技术彻底超越所有人。

To me, every every paper I see come out always has like some fix to gRPO. It's kinda cool that like people are, you know, taking variations on it, but also I don't know if DeepSea is gonna come up with r two and just blow away everyone with whatever is next.

Speaker 2

确实。算法从来不是最关键的因素。就像我在AI工程师世界博览会的演讲中略带讽刺地说过的：如何训练推理模型？就是获取初始数据集，逐步优化数据，直到时间耗尽或性能停止提升，然后测试所有论文提到的调整开关，通过网格搜索找出有效方案。

Yeah. I definitely don't think the algorithm tends to be the most important thing. Like, I think I had this in my AI engineer world fair talk, which is kind of a snarky of, how do you train a reasoning model, which is, like, you get a starting dataset, you incrementally improve the dataset, you do that until you're running out of time or your performance starts going up, and then you try all of these switches from all the papers or you turn all the you do a whole bunch of binary tests of all these various algorithmic changes, and you do a grid search and see what works.

Speaker 1

坦白说，这就是为什么GRPO刚推出时我没当回事，因为它被宣传成效率工具。是啊。我当时就想，好吧。但你知道，我...我被训练得不在乎效率，因为那只是资源问题。

Like, candidly, that's why I dismissed GRPO when it first came out because I it was sold as an efficiency thing. Yeah. And I was like, okay, fine. Like, but like, you know, I I I was I've been trained to not care about efficiency because it's just a matter of resources.

Speaker 2

没错。GRPO的优势评估非常适合可验证的奖励。对。但另一点比较抽象，更适合基础设施类论证。比如我们推出DeepSeekMath时——那远在RLBR阶段之前——

Yeah. The GRPO advantage estimate is very well suited to verifiable rewards. Right. But the other thing is kind of a intangible works better on the infrastructure type argument. And when it like, we came out for DeepSeekMath, which is well before the RLBR phase.

Speaker 2

当时确实是那么营销的。

So it was really marketed as that.

Speaker 0

谈到混合模型时，你如何协调OpenAI声称要淘汰模型选择器转向统一接口的说法？你觉得他们是否感受到压力——比如'看，我们有这么多类别，得打包成正确形态'？还是有其他原因？

When you talk about hybrid models, how do you reconcile that with OpenAI saying they wanna move away from the model selector to just having a unified interface? Do you feel like they feel pressure to like, hey, look, when I have all these different classes, we wanna wrap them to the right thing, Or do you think there's something else?

Speaker 2

我认为OpenAI想要一个能理解抑郁深度的模型。这对大多数研究推理的人来说应该是北极星——模型会分配恰当的token量。从算力层面看，就像推理时间缩放的意义。嗯...我觉得混合推理器可能终将被淘汰，除了某些细分场景，因为质量远比节省100倍推理token重要。

I would think that OpenAI wants to have a model that knows how hard depression is. I think that has to be the north star for most people working on reasoning, which is I the model will just spend the right amount of tokens on it. And if you look at a compute level discussion, see in like what inference time scaling means. Mhmm. I think in plenty of ways, like, hybrid reasoners might just be aged out except for niche applications because quality is so much more important than having a 100 x less inference tokens.

Speaker 2

就像...你直接为算力买单，效果就会更好。这很像黄仁勋最近说的——记得StrictlyVC重点报道过他的访谈。他说一切终将成为推理模型，因为成本会低到忽略不计且效果更好。所以混合推理反而显得奇怪。我现在永远开着推理模式，除非是愚蠢查询比如'这是什么东西？'

It's like, you just pay for it and compute and that'll get better. I think it's like really like that was something like Jensen said in his most recent, I think, strict techery highlighted it or had the interview with him. And it was like, yeah, everything's gonna be a reasoning model because it's gonna get so cheap and they're better. And I was like, that that's why it's like the hybrid reasoning thing is a little bit weird. And it's like, I always just will turn reasoning on unless it's a really silly query like, oh, I like what is this thing?

Speaker 2

所以两年后这个趋势会很明显——虽然O3现在还在烧钱给我们用。比如我查论文时它检索80个网站，那得多少token啊。但方向上看，如果这方法有效，就会成为默认方案。

So it's like, okay, like in two years that kind of tracks, which like I think o three is also just burning money on us. Mean, searches 80 websites for me asking what paper it is. Like, that's a lot of tokens. But it seems directionally, like, if that's the thing that works, that'll be the default. Yeah.

Speaker 2

至少在所有这些高端...

At least in all of these high most

Speaker 1

的...

of the

Speaker 2

我们接触的领域，无论是编程还是高端信息经济，从业者都认为价值确实存在。

things that people that we talk to, whether it's coding or very high end information economy, those think the value is is there.

Speaker 1

我想深入探讨一个你反复提及的观点。你似乎断言O3通过频繁使用搜索功能实现了与众不同的效果，其使用频率远超其他大多数方案。那么现在所有模型都内置搜索引擎了吗？这已成为标配功能？

I wanted to double click on something that you seem to be coming back to a lot. You you you seem to assert that o three does something very different by using search a lot, much more than basically everyone else. Yeah. Do model do all models come with a search engine now? Is that like a must have?

Speaker 2

这取决于具体使用场景。如果是进行通用信息检索或理解任务的话...是的。我们可以尝试找找相关旧论文的链接...

It depends on your use case. Yeah. If you're doing general information retrieval or understanding it yeah. Yeah. There's old papers that we can try to find the links that I think.

Speaker 2

不确定Sam Mullen是否讨论过，但DeepMind那篇复古论文和其他架构最近又被重新提起——采用极小模型搭配超长上下文窗口和巨型检索库的方案。虽然我不反对Transformer架构在长上下文处理上的探索，但人们重新推崇这种'搜索优先'理念是有道理的。观察各类推理模型的评估数据，简单QA指标普遍下降：从DeepSeek R1到新版R1，QN2.5到QN3，所有基础QA成绩都在下滑。

I don't know if Sam Mullen was talking about it, but there's this retro paper from DeepMind and other architectures that people have been pulling in the discussion again, which is like, you have a very small model with a very big context length and a very big retrieval store, which I'm not one to bet against the transformer architecture and just figuring out long context and stuff like this. But those are ideas that people are bringing back, which is search search is better. You look at all the evals from reasoning models and one of the trends is that like, Simple QA numbers all drop. It's like DeepSeek r one to the new r one, it goes down. It's like all the new, like, QN 2.5 to QN three, Simple QA goes down.

Speaker 2

至少在无工具辅助的评估环境下如此。简单QA本被认为是衡量长尾知识掌握度的优质指标，而所有原始模型在这方面的表现都在退步。或许对于长尾信息而言，搜索行为模式确实更具合理性。

At least when you're evaluating these without tools. And Simple QA is like a what is considered to be a very nice, fairly numerically robust, like long tail knowledge evaluation. And all of these the raw models, they're all going down. But it just may like long tail information, just to have this search behavior makes a lot more sense.

Speaker 1

我理解这个观点，但反过来说——我自己也经历过这种思考：为什么不打造一个仅依赖搜索的'零知识'模型？理论上可以实时检索所需信息。但问题在于：你必须先知道该搜索什么关键词。

Okay. The counter argument for this, just I have been through this journey too of like, oh, why don't you make like a model that doesn't know anything but search? Right? You can search out anything that you wanna learn just in time. But the problem is you'd need to know what the search terms are.

Speaker 1

整个机制运作需要基础智能作为前提。

You need some baseline intelligence to make all this work.

Speaker 2

确实，这个表述很精准。

Yeah. That makes sense. That's a that's a good way to put it.

Speaker 1

我认为这很关键，因为现在有种观点认为语言模型将彻底转型为'永久在线型'。虽然Perplexity最早将搜索引擎与LM结合的做法引起了我的注意，但尚未形成主流趋势。不过现在越来越多服务将其作为默认功能提供，比如Gemini也推出了搜索锚定功能。

I think it's important because there's this thesis of like LMs becoming just online LMs, like, permanently. And it hasn't been super pursued, like, Perplexity was one of the first to put it on my radar as, like, they were, like, we'll attach the search engine to the LM and that's what you get now. And I think, like, more and more people are starting to offer it as part of their default services. Like, Gemini has, like, a a search grounding thing as well.

Speaker 2

这也是人们诟病Anthropic的主要局限——它使用Brave搜索引擎，返回结果中充斥着大量SEO垃圾内容...

I mean, it's what people say a big limitation of Anthropic is because it uses Brave Search which returns a bunch more like SEO slop than

Speaker 1

有实证依据吗？我不太确定...还以为他们用的是自建索引。

Is that proven? Because I I I don't know. I I thought they had their own index.

Speaker 2

好吧。我其实没有详细研究过这些书。是的，所以我只是在处理一些传言。但我认为他们最终都会建立自己的索引，这应该是谷歌能再次占据优势的领域之一。没错。

Okay. So I don't I don't have I haven't done detailed books Yeah. So I'm dealing with rumors. But I I think they'll all do end up doing their own index and it should it's one of those things that's like Google should have an advantage again. Yeah.

Speaker 2

但谁知道他们会不会这么做？我在帖子中也暗示过，就像Hamish曾尝试让RLVR的同一个学生用RL模型做搜索实验。要让模型使用工具很容易——只要给出提示就行，但要让RL模型真正理解工具的用途却很难。这就是为什么需要经历80次工具使用失败后它才突然掌握，或者在第81次时才成功。

But who knows if they do? I also hinted at this in my post, but it's like, Hamish had tried to set this up the same student from RLVR playing with like, search in an RL model. And it's very easy to get the model to do tools if you prompted to, but it's very hard to get the, like, RL model to learn that the tool is useful. And that's why it's to go through these things where it's, 80 failed tool uses and it still gets it or like it stops or it gets it on the eighty first.

Speaker 1

明白了。

Okay.

Speaker 2

这种RL行为感觉像是通过精心设计的环境自然涌现的——他们让模型学会了使用工具，这不是通过监督微调能做到的。感觉他们搭建了完美的环境，并融入了之前深厚的科研成果，将问题分解为多个子RL任务，这才让模型实现了突破。虽然不想总提OpenAI，但我经常推荐大家试试o3，因为它确实很特别。

It's just the RL behavior that feels emergent from having a very nice way of like getting the model to learn to use the tool and it's not like like you can't SFTE this model to do this. Like, it just really feels like they set up the environment right and it plugs into this deep research kind of line of work that they did and they broke down the problem into these sub RL tasks. So then it kind of lets it do this thing. Interesting. I don't wanna be an OpenAI show all the time, but I just think I tell people to play with o three all the time because it's it's weird.

Speaker 1

确实出色。不过你似乎高估了深度研究团队的工作量——据我所知只有三个人：Issa和她的两位合作者。从Over the Eye的信息来看，深度研究基本上只是o3的轻量级封装。

It's excellent. I would say like the amount of work you're imputing on the deep research team when like, as far as I know, it's three people. It was Issa and like the two other collaborators that she had. I don't know if they did that much on top of all three. Like, every indication I've had from Over the Eye is that Deep Research is more or less a thin wrapper over, well, just o three.

Speaker 2

对，可能只是添加了一两个小改动，比如‘我们只需要在训练数据里加这点东西就能让深度研究运作起来’。

Yeah. It's probably like one or two small things that it they're like, oh, we can make our we can make deep research work by adding this small amount of data to the training thing and then it just works.

Speaker 1

我也是这么认为的。

That is that would be how I describe it.

Speaker 2

前几天匿名用户Gwen在我关于Q星的推文下回复说‘这些全错了’。显然简单方案难以扩展，当时AI领域有很多令人分心的复杂进展，而OpenAI又发布了许多让人困惑的内容。

I mean, it's I mean, what is it? Gwen, the anonymous person, he replied to my q star post on Twitter the other day, and he was like, why was this all wrong? Yeah. And I I it's obviously, like, simple things don't scale. There's a lot of complexity because there's a lot of other exciting things in the AI field at the time, and OpenAI kind of sends out a lot of things that confuse people.

Speaker 2

但深度研究符合这个模式——它只是对现有o3 RL路径的微小调整。他们可能早发现搜索会更好，于是重新包装了这个简单却关键的改动。很多突破都是这样，一旦有了基础，后续改进就显得顺理成章。

But this would fit into that, which is deep research is a minor change from an existing RL trajectory of what was, like, o three probably they had already figured out that search was gonna be better, and then we're like, okay, we can repackage this. And it's a it's a simple thing that makes a big difference. Yeah. And most of the things are like that once you have traction. Yeah.

Speaker 2

我认为最难的是在S型曲线初期实现突破（比如Q星）。但进入中期后，很多事会变得显而易见——就像我们为Omo做的工作，多数只是推动那些看似明显但实操困难的事。比如开发全新推理方案就极具挑战，因为要克服现有Vowel套件的惯性：是分支开发还是基于open reason从头开始？这完全是另一个层面的难题。

I think once trying to get the initial takeoff on the sigmoid is the hard Q star thing. But then once it's like once it's like this, a lot of things in the middle feel obvious, which is why I described one of the things that we work on for Omo. It's like a lot of it is just having motivation to do things that feel somewhat obvious, but they're still hard. Like, it's it's hard to get different recipes or it's hard to get a full reasoning recipe off the ground. It's just like a huge change because you have all this inertia on this Vowel suite, and then you have to figure out if you branch your recipe or do you start from, like, do we just take, like, open reason or zero and start from scratch, which is like, it's a whole other headache of of things.

Speaker 2

要让这些涉及5到10人以上、存在惯性的项目动起来确实很难。但一旦开始攻坚，事情就会显得非常明朗。

It's just hard to move these projects that are anywhere above five to 10 people with inertia to get stuff done. But then once you're hill climbing, things can seem really obvious.

Speaker 1

是啊。好的。你刚才讲了很多内容。

Yeah. Okay. You covered a lot there.

Speaker 0

在问下一个问题前，关于Brave的话题收个尾——我们的朋友Simon Willison发帖指出，Anthropic已将BraveSearch列为他们产品的子处理器之一。

Before my next question, just to close the Brave thing, our friend Simon Willison wrote a post that Anthropic added BraveSearch as one of the sub processor in their product.

Speaker 1

是的。

Yes.

Speaker 0

所以消息源头就在这里。至于实际使用程度，我们不得而知。

So that's where the thing came from. Now, to what extent it gets used, we don't know.

Speaker 1

确实不清楚。我...我想就他说的几点简单评论下，然后再回到你的问题。对了，有篇关于Q Star回顾的精彩文章，就是你写的那篇很棒的文章，我想推荐给大家看的是那篇关于o1 SIOP的...

We don't know. I I I would I would just kinda comment on a couple of things that he said, and then and then we'll go on to your your question. Yep. There's a very good post on just on the retrospective of Q Star. There's a very good post that you had, which was that I wanna send people to is which is like it was open was o one SIOP.

Speaker 1

对吧？这自然引出一个问题：如果o1是SIOP，那现在还有哪些可能是SIOP呢？

Right? That does imply the question of like, if o one was a SIOP, what what else could be SIOPs now?

Speaker 2

没错。肯定存在其他SIOP。整个推理时间缩放曲线就是个典型SIOP。为什么？当你把这两样东西并排放置，配上X轴时，看起来就像很容易控制似的。

Yeah. There's definitely a SIOPs out there. I mean, the whole inference time scaling plot is such a SIOP. Why? You like put these two things next to each other with an x axis and it just looks like it's easy to control.

Speaker 2

只要看到X轴，人们就觉得容易掌控。而我们左边那个是训练过程。对，训练逻辑很清晰。如果你翻看早期强化学习论文就会发现，RL学习曲线通常用非线性X轴表示，它们呈现这样的形态...

Whenever you see an x axis, you think it's easy to control it. Whereas like for training on our the left one was training. Yes. And training makes a lot of sense. So if you haven't even you go to really old RL papers, RL learning curves are a non log x axis usually, and they look like this.

Speaker 2

就是这种类似对数或指数上升的曲线。如果改用对数X轴，它就成了直线。所以这部分就像'哦，我们在强化学习中见过'。但推理时间缩放之所以让人产生'有个调节旋钮'的错觉，正是因为它用了X轴表示。

They look like these, like, whatever, like, logarithm or exponential rise. And then if you take one of these and you make it a log x as it's a straight line. So, like, that side is like, oh, okay. We've seen this before with RL. But with inference time scaling, it being an x axis is why people are like, oh, there's a knob.

Speaker 2

我能搜索到大量信息。是的。这就像滋生各种奇怪想法的温床。那篇文章的核心观点是，他们从训练过程中选取数据点，或是利用自然存在的方差，然后将它们排列起来。一旦排列妥当，就能得到这种漂亮的推理时间缩放行为——现在很多人已经复现了这种推理时间缩放现象，如今情况要明朗得多。

I can turn search up a lot. Yeah. And it's just like what breeds all these weird ideas. The core of that article is just they're taking points from within training or there's a natural variance, and then you line them up. And if you line them up, then you get this nice inference time scaling behavior, which is and now people a lot of people have reproduced this pod on inference time scaling, and it's it's it's much clearer now.

Speaker 2

但当时我就想，我明白为什么我认为这是个调节旋钮。就像——看啊，他们称之为推理时间缩放。这是你可以控制的。

But at the time, it's like, I I see why I thought it was a knob. It's like, oh, look. It's a they called it inference time scaling. You control it.

Speaker 0

我觉得最有趣的——好吧，你博客里有很多有趣内容。但特别引人注意的是关于强化学习与工具使用的部分。你说在强化学习实验中，让模型尝试搜索很容易。但如果使用工具没得到结果，它很快就会停止使用工具。我们能改变这点吗？

I think the most interesting well, you have a lot of interesting things in your blogs. But one that stood out was about RL and tool use. You said that it's easy in RL experiment to tell the model to try searching. But then if it doesn't get results with the tool, it's gonna stop using the tool very rapidly. Can we impact that?

Speaker 0

那么是否存在模型不会使用的好工具，导致它失败后放弃使用？是否存在应该先改进而非放弃的坏工具？人们该如何考虑工具设计、模型改进，以及干预点的选择？

So can there be a good tool that the model doesn't know how to use and then it kind of fails and then it stops using it? Can there be a bad tool that should be improved before giving up on it? How should people think about designing the tool, improving the model, and kinda like where to intervene?

Speaker 2

这绝对是我近期想深入研究或已开始探索的新方向。特别是在2026年，开源领域的基础设施模型将迎来重大突破，我想在深度搜索类应用或需要多次推理调用的场景深入探索。回答你的问题——确实存在坏工具，也完全可能是模型使用方式错误。我期望模型具备的不是创造力，而是某种开放性：它并不完全清楚能从工具获得什么，这种不确定性会促使它尝试多种方案，这几乎像是经典强化学习行为。

This is definitely on the newer side for my things that I wanna work on or have worked on. I think particularly in 2026, especially in the open side, all the infrastructure models will cut up a lot where I want to go deeper on this in terms of like deeper search style things or very inference heavy multiple calls. And to answer your question that there definitely can be bad tools and there definitely can be like the model just using them wrong. And something that I would want to see in a model is kind of not necessarily creativity, but like an openness that it doesn't know exactly what it'll get out of all of its tools and this uncertainty to just try a few different things. Which almost seems classical RL behavior.

Speaker 2

但想想语言模型的运作方式——它们总是高度...不能说自信，但回答时存在明确路径和方向。而推理标记带来的重大改变，就是引入了回溯等概念，这种对工具未知可能性的开放性非常重要。模型应该具备'要不试试这个？会得到什么？'的思维，尤其在开源模型领域——如果人们想用开源模型配合工具，必然是因为涉及私有数据存储等问题。

But if you think about what a language model does, they're always very comp they're not necessarily confident, but they have like a path and like a direction in their answer. Whereas that's a big change in these reasoning tokens is to have the notion of back tracking and and things like that, which is some sort of like openness to the tools having things that are unknown and it seems like a really nice thing for the model to have, which is like, oh, what if I tried this? Like, what does it get? Yeah. Especially in the on the open model side, which is if if this is gonna work where people want to use open models with tools, it's gonna be because people have private data stores and stuff.

Speaker 2

假设你要训练一个像GPT-3那样优秀的开源推理模型，但基于永远不会上传到云的私有记录——它必须思考'可以尝试多种方法摸索'，而不是直接放弃。目前工具使用更像代码执行，或是实现某个既定步骤：先有计划，某步失败可能有备用方案，但缺乏'需要通过环境交互来制定计划'的迭代思维。

So if you were to train an open model that is gonna be a good reasoner like o three, but on private records of some sort that'll never get sent to the cloud. Like, it needs to be thinking of like, can try some things with this to get a sense for it before saying that I have to give up. And if you look at tool like tool use right now looks seems much more similar to like code execution or it's just it's just a part of a sequential path that you need to get to, which is like, have a plan. And if it fails at a certain step, I might have a backup. But it's not like this iterative of I need to fiddle with the environment in order to come up with my plan.

Speaker 2

这可能需要通过专门训练来实现。比如告诉模型：'你不知道这里面有什么，但答案可能就在其中'——这种提示很奇怪，但或许有效。

It's just that I it's something that people probably are gonna have to train into these models, which is like, you might just tell it. You're like, you don't know what is in this, but your answer might be in it, which is like a very odd prompt, but maybe it'll help.

Speaker 0

没错。之前Anthropic的Eric Schlans（开发Cloud Agent的先驱）提到，他们大部分时间都花在工具设计上，模型后期自然就学会了。你实际工作中——不知道你接触多少这类工作——是逐个工具引入URL流程吗？你觉得这样有效？还是说一次性提供所有工具让模型探索更好？

Yeah. When we had Eric Schlans from Anthropic who worked on the Cloud Agent before Cloud Code, he mentioned they spent basically like majority of the time on like the tool design to get to the model, and then you just kinda learn how to do it. Are you usually well, I don't know how much you've worked on actual this stuff, but are you putting the tools one by one in the URL process? Do you think that helps? Or do you usually give is it better to give all the tools and let the model explore?

Speaker 2

我不太确定。我们还没完全走通这条路。可能取决于模型和起点——如果模型已擅长工具使用，泛化能力应该会更强。

I don't really know. Like, we haven't gotten this to work. I would say it would probably depend on the model and your starting point. If your starting point is already good at tools, it can probably generalize more. Yeah.

Speaker 2

但如果你在进行这种奇怪的基座模型强化学习，就必须设置这种渐进式课程——当强化学习规模扩大到一定程度时，必然需要难度递增的课程体系。这相当显而易见。因此在这种情况下，可能是当系统难以解决某些问题时就会新增工具，这听起来很直观，但实际操作中却极难管理。因为你如何在训练过程中自动化判断何时该触发这种机制呢？

But if you're doing this weird base model RL and you have to have this kind of curriculum like, if you scale RL long enough, you're gonna need a curriculum of things getting harder. And like that's pretty obvious. So in that case, it might be tools get added when things become too hard for it to solve certain questions, which would be which sounds very intuitive, but also just really hard to manage in practice. Because what is your automated signal on your training run that is time to do that.

Speaker 0

这就是电子游戏设计精妙之处，它们会随着进度逐步解锁内容。但像搜索功能这类东西，比如只给你访问小型数据库还是开放整个互联网知识库，情况就不同了。

That's why video games are so good because they're designed to unlock things as you progress. But I think like with things like search, like, you know, if you're given access to a small data store or you're given access to all knowledge on the Internet.

Speaker 2

没错。这对ArcGI团队开发v3基准测试是个很好的启示——语言模型在达到某个阈值后，需要学会使用世界中的新执行器。

Yeah. It's good feedback for the ArcGI people for the v three benchmark is, like, have things for the language model needs to learn to use new new actuators in the world after a certain threshold. Oh.

Speaker 1

那应该就是RKGI第四代了。

That would be RKGI four then.

Speaker 2

嗯，很可能。

Yeah. Probably.

Speaker 1

我不...

I don't

Speaker 2

知道。他们正在批量产出这些版本。

know. They're they're cranking them out.

Speaker 1

确实在快速迭代。他们几周后还要办发布会。其实我觉得玩RKGI很有意思，不知道你试过没有？

They're cranking them out. They're actually doing a launch party, I think, like, in a couple weeks. So I'm actually really like, it's fun to play RKGI. I don't know if you tried.

Speaker 2

哦，我还没试过。

Oh, I haven't.

Speaker 1

挺有意思的。就像智商测试——我以前觉得这种测试没什么意义。但现在随着语言模型在这些测试上的分数显著提升，反而让人更感兴趣把自己的智力水平与之比较了。

It's pretty fun. Like, are IQ tests. I used to be like, oh, like, they weren't that relevant. But, like, actually, now that we have a gradient where like LMs are actually significantly climbing them, now it's actually really more interest like interesting to compare your own intelligence.

Speaker 2

我是诺姆·诺姆，关于无束缚装置。没有束缚装置？

I'm Noam Noam on no no harnesses. No harnesses?

Speaker 1

是的。

Yeah.

Speaker 2

对。我是说，束缚装置很酷，但它们会极大地改变学习动态，成为一种限制。所以作为演示不错，但我认为核心重点必须是无束缚。

Yeah. I mean, harnesses are cool, but they're gonna they're they're a handicap that's changing the learning dynamic substantially. So it's good it's good demos, but I feel like the core thrust has to be no harnesses.

Speaker 1

我是说，这总是像...说这些只是归纳偏差有错吗？对吧？它们不在模型里。当然。但任何你只是查看结果的行为都会污染...

I mean, it's always like, is it wrong to say that these are just inductive biases? Right? Like, they're not in the model. Sure. But like, anything where you're just like looking at the results contaminates

Speaker 2

这完全是不同的任务。我想我做过...或者说我在Arc AGI和Greg讨论过这个，我告诉他，用束缚和不用束缚都要做。

This is just a different task. I think I do it or I mean, I've I think I talked with Greg about this at Arc AGI, which I told him, like, do harness and to harness.

Speaker 1

你们只是有不同的分类。

You just have Different categories.

Speaker 2

所以就像，你想保持透明并为前沿实验室建立目标？那就两者都做。我不觉得会稀释太多。无束缚显然更难，但你的基准测试会获得更大收益。

So it's like, you're trying to be transparent and build targets for Frontier Labs? Just do both. Like, I don't think it dilutes that much. The no harness is gonna obviously be harder, and then you just get more bang for your buck on your benchmark.

Speaker 1

嗯。是的。是同样的数据集。既然说到工具，你之前对多工具RL近期工作（包括Loop、Retool、Toro等）的总结很棒。我觉得这个领域现在研究价值极高。

Mhmm. Yeah. It's the same dataset. Staying on the topic of tools while we're at it, you had a really good summary of, like, recent work in multi tool RL, which was which had, like, loop and retool and Toro and all these other things. And I think that this is just, like, an area that's super rich for research right now.

Speaker 1

我想给你空间来强调：你最喜欢哪些？你认为人们应该探索什么方向？

I just wanted to give you the space to, like, highlight what are your favorites? What do you think that people should explore?

Speaker 2

我可以分享我中等野心的趣味研究项目构想。要创造某种竞争动态或元环境，但必须比工业界做的范围窄得多。午餐时我提过——深度研究但仅限档案论文。这样不需要完整索引，领域也有限。

I could share what my moderate ambition, what would be fun research project things is. As you want to create some sort of competitive dynamic or a vowel, and it has to be so much narrower than what industry is doing. So I I I told you this at lunch, which is like, deep research, but only archive papers. So you, like, don't have to do a full index. You have a limited domain.

Speaker 2

你需要想办法衡量它，或者类似的东西，我认为学术界人士研究学术工具很有价值，因为他们具备极高的领域专长。他们已然洞悉现状。关键是如何将其转化为对用户极具实用性的成果，或是可供攀登的阶梯。比如突发奇想——从论文中提取相关研究，仅保留文本并断开所有链接，然后创建一个评估任务：用存档链接补全数百项相关研究。这就像个有趣的研究型深度构想。

You have to figure out how to measure it or something or some I think, like, act it's good for academics to work on academic tools because they have very high domain expertise. They already know what's going. And just, like, figure out how to make that something that is either very useful to users, if it's gonna be good enough for that, or something you get to climb on. And I don't know if those are like, it's like brainstorming on the fly of like, take related works out of papers, just look at the text and break all the links, and make an eval which is filling in hundreds of related works with archive links. Like, that's a fun deep research style idea.

Speaker 2

试试能否用开放模型配合工具集在固定数据存储上实现。AI领域对此也有过大量讨论——当前若想在AI界产生影响，学者必须突破论文层面，转向产出模型、数据集和评估指标等实体成果。数据集和评估指标更容易让人产生影响。接着要考虑的是：在AI领域，特别是像语义学者团队这样正在研发各类信息代理的群体中，人们实际使用的工具是什么？还有些概念我记忆模糊，无法列举全部名称。

See if you could do it with open models on a set data store with tools. AI too has gone through a lot of discussions with this, which is you if you're trying to have impact in AI right now, it's as an academic, you have to like level up out of papers to artifacts, which is models, data sets, vowels. Data sets and vowels are easier for people to have impact on. And then the next thing is like, what do people actually use in AI too, especially in this like semantic scholar team that's now working on like information agents of different types. There's another thing that I'm like distance in, so I don't have all the names.

Speaker 2

但核心是：能否让开放模型在这方面做得更好？关键在于创造人们真正在乎的东西。若能拥有实际用户，影响力将跃升至全新层级。这对学者和小型机构确实颇具挑战。嗯。

But it's, can we make open models do that side of thing better? It's like, can you make something that people actually care about? And then you're that's a whole level of impact that's much higher if you have actual users. It's it's hard for academics and small institutions to do that. Mhmm.

Speaker 2

但若从事智能体研究，持续优化是可行路径。比如打造我们团队喜爱的Slack摘要机器人？关键在于使这些智能体真正易于驾驭。这是一条路。另一条路是借助工具不断攻克人类最后的考试难关。

But if you're working on agents, like, feeding is viable. It's like, can we make ourselves a good Slack summary bot that we like or something? And just making these agents really tractable. I mean, that that's one direction. Another direction is just hill climb on humanity's last exam with tools.

Speaker 2

我认为学术界很难在技术指标上取胜，因为大公司即将投入数百万token处理单次查询。这需要海量算力，在等效浮点运算上超越他们极其困难。关于非结构化思考，我目前态度是稍后再议——在建模和我所称的'技能层级'（即如何通过推理实现推理时扩展并获得高评估分数）上仍有诸多问题待解。掌握这项能力后，便可将其迁移至更专业领域。

I just think it's kind of unlikely that we're gonna win as a academic and a state of the art number because they're gonna start spending millions of tokens per query. And it's just a lot of it's a lot of computer and like the getting beating that on the flop equivalents is gonna be so hard. Unstructured thoughts is something that I'm mostly like, okay, I'll get to this. Like, I have I have more things to figure out on the modeling and what I call like skills level, which is just how do you do reasoning to induce inference time scaling and get high eval numbers. And once you know you can do that, you can take your knowledge with you to do it in more specific domains.

Speaker 1

技能获取本身也是种技能对吧？这符合ArcGI对AGI的定义。

There's skill in your skill acquisition. Right? They think the ArcGI definition of AGI.

Speaker 2

我引用过这个定义。具体是什么？应该是效率——对，技能获取效率。因为我曾用三个词概括它。

I quoted it. What is it? It's like efficient yeah. Skill acquisition efficiency. Because I used describe it as three words.

Speaker 1

没错。你在近期演讲中反复强调技能概念，是否愿意向听众重申这个理论框架？

Right. Yeah. Your emphasis on skills in in your recent talks that you've done, do you wanna sort of reiterate that that thesis for people to pick up on?

Speaker 2

好的。我主要思考的是如何预判OpenAI等机构尚未部署的研发方向。在所有智能体研究中，规划能力似乎至关重要。因此需要建立分类体系，明确哪些推理模型训练会成为瓶颈。此前我提出过基础层级——技能，这已通过o1和r1项目验证：通过大量强化学习实现推理时扩展，从而获得极高基准分数。

Yeah. So I've been thinking about mostly, I'm trying to get ahead of what OpenAI, etcetera, are doing probably now if it's not in their models. And with all the agents, it seems that planning is a very critical task. So it's kind of how do you come up with the taxonomy for different types of things you need to train into reasoning models for when it'll be a bottleneck. And the found so I came up before, and the foundational one was skills, which is what I would say that we have already done with o one and r one, which is you do a lot of RL, you show the inference time scaling works, and you get really high benchmark numbers.

Speaker 2

后续三个层级关乎进阶发展，大多围绕规划展开。在我的清单上，第三、四位是抽象与策略——刻意避开'规划'这个被过度使用的术语。策略决定模型行动方向，即技术层面的步骤规划；而抽象关乎如何将其分解为可执行方案。

And then the next three are kind of what comes next. And most of them are around planning. So what I had is three and four on my list were abstraction and strategy, which is trying to not use planning because planning is a word that people already use a lot. Their strategy would be the direction the model should go in and like, technically, like, what are the steps of its plan? And then abstraction is how does it break it down into things that can actually solve?

Speaker 2

最后第四点是校准问题，即不浪费算力并知道何时放弃转而询问用户，因为过度思考显然是个问题。通过增加推理时间很容易让评估分数持续提高，但最终用户并不需要这样的模型。他们想要的是更智能的训练机制，使模型能真正通过训练成比例地提升性能——关于过度思考等话题已有大量论文，我认为OpenAI关注这点是因为他们要承担GPU成本。如果GPT-3为大量用户无限循环运行，那显然不理想。

And then the fourth last thing is calibration, which is just not wasting compute and knowing when to like give up and ask the user things because like overthinking is obviously a problem. It's easy to keep getting your eval scores to go higher by using more inference time scaling. But eventually, like, that's not what people want in their models. They want they want a smarter training regime where the model is actually getting proportionately better for its training and not there there's a lot of papers on overthinking and stuff like this, which I think is like OpenAI wants it because they have to foot the GPU bill. Like, if o three just infinite loops itself for a bunch of people, like, that's not good.

Speaker 2

实际上会不会这样我不确定，但有可能。这些推理方法确实会让模型变得不稳定。不过这也引出了GPT-5的构想：如何让模型将问题路由到正确的处理路径？不一定是物理路由，而是让模型知道何时需要制定计划，何时可以直接回答。

Does it actually I don't know, but it might. Okay. I mean, like, these reasoning methods definitely can make the models just kind of unstable and just yeah. So it's but like it's also the GPT five idea, which is how do you get a model that just routes the question to the right? Not maybe not necessarily a router, but just knows if it needs to do a plan or if it can just answer.

Speaker 2

如果你向DeepSeekR1提出复杂数学问题，它不会先说'这是我的解决计划'，而是直接开始解答。理想的模型应该能判断何时需要声明'这是我的解决计划'，何时需要建立记忆存储，或针对特定查询采用Claude式的代码处理方式。

If you look at DeepSeekR one and you ask it a hard math question, it's not like, here's my plan of attack. It just starts. And having a model that knows when to be like, okay, here's my plan of attack. I might need to make myself a memory store. I might need to take like a Claude code approach for this query.

Speaker 2

它可以建立记忆存储并启动并行搜索器后再返回。理论上这些都可以通过训练实现，因为搜索器或并行模型此时相当于工具。简单来说就是设置思考令牌和回答令牌，模型应能在思考或使用工具前选择性加入计划令牌，比如'这是基本要求，我需要完成这些任务'，不同任务难度各有差异。

I'm gonna build a memory store and spin up some parallel searchers and then come back. Conceivably, this is all something you can train into a model because the the searches or the parallel models could be like tools in that case. The simple way to describe it is we have something like thinking tokens and answer tokens, and it's the model should be able to optionally have like plan tokens before before thinking or before using tools. It's like, okay, like, here are the table stakes. I need to do these things, and these sorts of tasks will be harder versus easier.

Speaker 2

这比某些激进的AI构想更可行。语言模型本就擅长制定计划，只需要被要求这样做——我打赌Claude和深度求索团队正在这么做：收到用户提示后，模型首先启动计划工具（Claude中就有这个功能）。

It seems more tractable than some far out ideas for AI. It's like I like a language model can write a good plan. It just needs to be asked to do so, which I'm would bet that Claude code and deep research are doing this. Like, you get a user prompt. And first, the model is like, yeah, there's a plan tool in Cloud Code.

Speaker 2

他们会先分解问题，这是通过训练实现的。虽然DeepSeek可能没有内置该功能，但理论上可以做到。关键在于模型是否需要完全自主处理整个任务流程——它能否胜任这类工作？

And they first they break it down and it's like, that is something they've trained into the models. Like deep I don't think DeepSeek has doesn't have it built in, but it it probably could do it. And just thinking about that interface between, like, if the model needs it to be able to do the task end to end on its own, like, can it do that sort of thing?

Speaker 1

我认为这种方案与'无约束'理念的协调难点在于：工程师们通常希望将计划和记忆建模为工具，没有专用计划令牌或记忆令牌，只有上下文环境。特别是计划功能，这样就能分发给其他智能体并行处理工具调用等任务。但这是否意味着我们面临道路分叉？是否必须做出选择：将功能外包给工具，还是保留在模型令牌体系内？

I think that my my challenge with this whole reconciling this approach with the no harnesses thing is that I think a lot of the way that people, especially engineers want to model it, is that the plans and and the memories are tools. And there are no special plan tokens, there are no special memory tokens, it's just context or it's just, you know, whatever. Specifically for planning because then you can do fan out to other agents for tool calls and stuff, so it doesn't have to be sequential. But I'm just like, is this a fork in the road? Like or, you know, do we have to make a a real choice here as to do we outsource things to tools?

Speaker 1

或是将其保留在模型的原生令牌体系中？

Or do we keep it native within the model's tokens?

Speaker 2

我不认为这是主观差异。提出计划概念主要是为了说明没有免费午餐——计划改进可能很平凡，比如我们发现Claude生成的计划存在缺陷，就提供更详细的任务分解数据来优化。

I don't think it's a subjective difference. I think mostly the planning idea is to make the point that people don't get things for free. And the planning improvements might be kind of mundane, which is like we were prompting Claude and its plans were bad in this way. Let's give some data where its plans are more detailed or break things down into more steps so that it's easier for them to do it. Yeah.

Speaker 2

因为这本质上是个黑箱系统。如果缺乏针对性训练，性能表现就难以预测。对于开源模型，可能需要为不同环节训练专用模型——比如专门训练擅长规划的模型，这就需要相应数据支持。

Because it's it's in a black box effectively. So if it hasn't been targeted, it's unclear of what the performance will be. Or on the, like, open model side, it might just be the idea of having different models for different parts of it. Is that then you're really training a model to just be good at planning. And, like, that's that's data that you need to come up with.

Speaker 2

我是说，你好像只在那部分使用那个模型。

I mean, you're like, you only use that model for that one part of it.

Speaker 0

是否觉得植物（指代码模块）更具可重用性，或许不该每次都重新生成？特别是在为某些任务集编码时，你会想要有相似类型的方案。所以也许不应该每次都要求模型重新生成方案。应该像有方案蓝图作为工具，然后由模型填充细节。你觉得这个平衡点应该在哪里？

Does it feel like plants are much more reusable and should maybe not be generated every time? I feel like, especially in coding for certain sets of tasks, you wanna have similar types of plans. So maybe it's not the right way to ask the model to regenerate a plan every time. There should almost be like plan blueprints as like tools and then the model fills it in. Like, where do you think the balance should be?

Speaker 0

我认为

I think

Speaker 2

它们是合理的。方案显然是个中间目标。只是这类规划层面似乎容易出现失败。就像流行的评分标准那样——你给个提示，语言模型就生成对应的评分标准（即需要做对的几个具体事项），这与为每个任务制定方案在概念上非常相似。

they're reasonable. A plan is obviously an intermediate goal. I just it seems likely that there's like failures on this kind of planning level. I mean, the same thing goes for these rubrics that are popular, whereas a lot of the technique that is popular for so called rubric things that you have a prompt and you have a language model generate a rubric for that prompt, which is a few specific things that needs to get right. And that's conceptually very similar to making a plan for for every task.

Speaker 2

我认为评分与执行需要不同类型的抽象。但人们发现相对效果而言，直接生成更经济。方案不会太长，token数也不多，所以就像'好，我们这么做'。把它归入我的分类法可能高估了——它只需是个提示，你只需确保模型在提示阶段别太离谱。

I think whether or not it's like grading is that you're gonna have a different type of abstraction than executing. But I think in what people are seeing is that it's cheaper relative to the effectiveness to just generate it. So like, plans are not super long and they probably they're not that many tokens, so it's probably just kind of like, okay, we do this. Like, putting it in my taxonomy might be overselling it where it just needs to be a prompt and you just need to make sure that your model's not too weird at that prompting stage.

Speaker 1

顺便说下，你的分类法超级有用：技能、校准、策略、抽象。感觉抽象可能是最被低估或最难解决的。你口头解释的和博客写的不太一样，你说它本质是不要过度思考。

I think your taxonomy is super useful by the way. So skills, calibration, strategy, abstraction. I feel like maybe abstraction might be the most underrated one or hardest to solve. The way that you introduced it, it was different than how you wrote in your blog post. You said it was basically not to overthink.

Speaker 1

那是校准。抽象是关于拆解事物。

That's calibration. Yeah. Abstraction is about breaking things down.

Speaker 2

对。策略和抽象在最难任务上最合理——那些我们不确定模型能否完成的任务。如果给模型分配不确定能否实现的任务，策略就很重要，因为它必须非常具体和狭窄。而如果是写普通代码或深度研究，方案反而不那么重要。

Yeah. I think both of these strategy and abstraction make the most sense on the hardest tasks that we don't know if the model can do them. Right. So is it like if you're assigning a task to a model that you don't know if it can implement it, the strategy is very important because it it needs to be very specific and narrow. Where if it's doing mundane code, like, deep research, the plan is actually not that interesting of a thing.

Speaker 2

没错。但当你处于能力边界时——比如搞GPU实现，你可以相信OpenAI和Anthropic的说法：'帮我在复杂分布式GPU系统里实现这个研究想法'。天，这对人类都难。AI要制定正确方案来调试和完成，路径非常狭窄。

Yeah. But when you're at the frontier of if it can, like, I don't know, some GPU implementing thing, you could buy in buy into the open AI and anthropic narrative, which is help me implement this research idea in our complex, like, more distributed GPU thing. Oh my god. It's like, this is a task that's hard for a human. And for an AI to come up with the right plan to debug and do this is very narrow path.

Speaker 2

因此策略很重要：是否从某些测试开始？如何构建出复杂性？显然我需要找更好的例子，但越往深处推，越会发现真正能解决问题的方案寥寥无几。而抽象的重要性随任务规模增大而凸显——几乎像提示工程。

So therefore, the strategy is pretty important of does it start with certain tests and how does it actually build this out to complexity? It's obvious that I need to come up with more better examples for this, but I think as you push it, it's more natural to see that there's only a few plans that actually get it done. And then abstraction is just important as your task becomes so big. It's like a prompt engineering thing almost. Yeah.

Speaker 2

这就好比，你只能生成10万个token。你需要确保模型能将其分解。所以它不会在自身下生成大量无限进程，我确实认为抽象是个有趣的概念，特别是当你开始思考这些模型可以调用其他模型为其执行任务，或者某些部分可以通过多轮搜索或更多计算来并行处理时。我觉得这某种程度上属于抽象范畴，就像你如何解决某个具体问题块？我必须承认，我在这方面没有实际构建经验。

And it's like, you only have a 100 k tokens you can generate. Like, you need to make sure the model breaks it down. So it's not just spawning a ton of infinite processes under itself, which I do agree that abstraction is an an interesting one, especially when you start to think about these models that could call in other models to do some task for it or parts that can be paralyzed with like multiple searches or just more compute. I think that kind of folds into abstraction, which is just like how do you approach a certain nugget of the problem? And I definitely say like, I don't have experience building this.

Speaker 2

感觉如果要可视化AI执行最困难的软件或其他任务，这些都是人类非常擅长的。比如，如何在十周内制定研究计划？如何优先安排哪些实验？这里面涉及很多归纳偏置，而目前语言模型并不擅长这些。

It just feels like if you're gonna visualize AI doing the hardest software or other tasks, it's something that humans are very good about. So it's like, how do you come up with a research plan in ten weeks? Like, there's a lot of how do you prioritize which experiments to do? It's like there's a lot of inductive biases that go into that that I don't like, a language model would not do well at that right now.

Speaker 1

或许记忆会有帮助。就像我们现实生活中的做法——通过积累经验来跳过某些步骤。

Probably memory would be helpful there. So you can just skip like, the way we do this in real life is we could accumulate experience.

Speaker 2

是啊。

Yeah.

Speaker 1

有件事我想深入探讨的是并行处理本身。以O1和Q Star理论为例，之前有个案例某种程度上被过度炒作。但现在随着O1 Pro和DeepThink又卷土重来。这个理论至少——如果我错了请纠正——本质上他们运行O1八次，然后用奖励模型评分，最后给你八次中最好的结果。

One thing I I did want to dive in on on was just parallelism in general. There's one case where with o one and instead of the the the sort of Q Star ideas, there was one case where it was sort of overhyped in some sense. But now it's coming back with O1 Pro and DeepThink. The theory is at least correct me if I'm wrong. Basically, they run O1 eight times and then they have a reward model rated and then give you the best of of the eight.

Speaker 1

对吧？

Yeah?

Speaker 2

差不多是这样。

Something like that.

Speaker 1

DeepThink也类似。我们不知道更多细节。我觉得至少信息提供方有很多人在探索如何并行化搜索、规划这些。不过我担心对此过度兴奋。

Something like that. Deep think also the same. I we don't know any any details beyond that. I think there's a lot of people exploring that at least on the info provider side of like, you know, how do we paralyze search and planning and and all that. And I'm worried about getting too hyped about it.

Speaker 1

从逻辑上看这很合理，就像当初MCTS也看似非常合理，结果我们被误导了。当然我不...

I think it makes a lot of logical sense and this is one of those things where MCTS also made a lot of logical sense and we were fooled. Well, I don't

Speaker 2

认为我们使用并行计算是为了搜索低概率token。我们用它来获得稳健性。比如O1 Pro的妙处就在于它即使在冷门话题上也能保持可预测的深度，而有些模型在这些地方会突然失效。

think we're using parallel compute in a way to search over, like, low probability tokens. We're using it to get robustness. If you like, o one pro is it was so nice because it just had a a very predictable depth to it even on niche topics where, like, sometimes models just fail.

Speaker 1

对。你之前提到一些数字，比如从10%左右飙升到了95%之类的。

Yeah. You you had some numbers that it went to, like, from, like, 10 to, like, 95% or something.

Speaker 2

具体数字我记不清了，但感觉就是这样。启用o three pro并不会让找到冷门信息的概率提升十倍。可能概率会高一些，但我们并没有获得那种能拓展搜索广度或深度的效果。所以我认为它的价值在于——我们可以把这种并行计算用在最关键的内容生成上，比如明确某部分特别重要时，就多投入些资源让这些内容更优质。

I don't remember the exact numbers, but that's what it feels like. It doesn't feel like you turn on o three pro to make it 10 times more likely to find some niche piece of information. Like like, maybe it'll be a bit more likely, but we're not getting that type of, like, searchy notion of getting more breadth or depth into our tree. So I think there's value there's value to it where we wanna use this parallelism on the what are either like the most important tokens that we're generating or like, okay, I know this part is crucial. Let's let's just spend a bit more so that those tokens are better.

Speaker 2

但这并非颠覆性改变。真正具有变革潜力的可能是验证器的改进——验证器能改变推理时间的斜率。推理时消耗的token越多，验证器效果就越好。如果采用并行计算，就能捕捉到罕见情况。

But it's not a, like, transformative thing. The part that's potentially interesting on the transformative side is, like, if you can get much better verifiers. So I think of verifiers of changing the slope of inference time scaling. You spend more tokens at inference, the better verifier you have. If you're doing parallel, it can extract a rare occurrence.

Speaker 2

比如现阶段，如果验证器仅擅长人类偏好判断，那我们就不需要大幅提升。但如果我们要生成高度多样化的内容，更好的验证器就能发挥更大作用。可以想象奖励模型和预言机的区别：预言机搜索得越多效果越好，斜率理想；而奖励模型的信号存在上限。

So like right now, if our verifiers are only like they're good at like human preference, it's like, okay, we don't need to we don't need to crank that up very much. But if we are doing really diverse generations and your verifier is better, it'll get better it'll do better. I think you could look at the extreme between a reward model and an oracle, where it's like, the oracle is the more you search eventually it works. So the slope is is good. But a reward model is like there's really a capped signal

Speaker 1

嗯。

Mhmm.

Speaker 2

至少在偏好判断领域是这样。所以斜率很平缓，存在收益递减。因此我认为如果能开发更有趣的验证器，并行计算还有更大潜力可挖。但就目前来看，它还不具备变革性。

Out of it. At least if you're doing this preference type of thing. So the slope is pretty minor and it kinda has diminishing returns. So I do think that, like, if you could fill that with more interesting verifiers, there's potentially more to get out of parallel compute. But I I don't think it is, like, as transformative right now on my Outlook.

Speaker 2

并行智能体的意义更在于——如果能很好地分解抽象任务，它更适合作为吞吐引擎来处理耗时任务，而非追求峰值性能。

It's more like parallel agents makes more sense of, like, if you could break down abstraction nice, like, as a throughput engine, if our tasks are taking a long time, rather than a, like, at peak performance engine.

Speaker 1

明白。

Okay.

Speaker 2

对。这也契合智能体与模型的区别：智能体更注重完成任务本身，强调鲁棒性和速度；而模型侧重单次生成——能否给出正确答案？

Yeah. Which is a kind of fits with the whole agent versus model thing, where agents are much more about, like, getting it done at all, like, being robust and being fast for, like, this model is one generation. It's like, can you get the answer right?

Speaker 1

没错。我会再花点时间研究这个，之后就可以继续了。我的反驳意见是：这是种提前验证未来假设模型的方法，之后还能进行知识蒸馏。这确实不错。

Yep. I will spend a little bit more time on this and I'm happy to move on. My pushback or counter to this is that it's a way to pull forward a hypothetical future model that you can then distill from. Yeah. Which is nice.

Speaker 2

嗯，我敢打赌人们肯定会把这些用于合成数据。就像合成数据的边际收益总是非常高。或者就像阿曼达·阿斯柯说的，更好的提示工程能让你仿佛拥有下一代模型。大多数人其实没在提示词上下功夫。

Well, I bet people I mean, they they surely will use these for synthetic data. It's just like the marginal gain on synthetic data is always very high. Or it's like Amanda Asco will say, like, better prompting will effectively make it seem like you have the next generation model. Like most people don't put effort into their prompts.

Speaker 1

哦，天啊。好吧。

Oh, god. Okay.

Speaker 2

或者她在某次访谈中说过类似的话。所以关键在于，如果你能真正摸索出如何让模型进入某些特定状态。

Or she had said something of those lines in one of her and throughout the interviews. So just like, if you can really figure out how to kind of get into the certain states of the model.

Speaker 1

对对。总之，这就是我认为值得做的理由。我还有个关于量子模型的科幻故事想写——在那个世界里，我们可以低成本探索多重宇宙，然后提取出可行的那个。虽然听起来太科幻，但如果真能掌控量子计算来实现这个并规模化，应该会挺酷的。

Yeah yeah. Well, anyway, that that that's my pitch for, like, why this is worth doing at all and, like, you know, I have a science fiction story that I wanna write about quantum models. In a world where, like, we you could explore cheaply multiple universes then, like, you know, sort of pull forward the right one, that would work. This sounds too science fiction y, but I feel like in a world where we could control quantum computing well enough to explore this and scale it up enough, it could be kinda cool.

Speaker 2

并行计算也可能催生有趣的创新类型。比如...我不确定，但并行计算对一次性生成所有token的扩散语言模型意味着什么？这会实质性改变某些应用场景吗？我觉得如果扩散语言模型能实现应该会很有趣。

It also could be that parallel compute is grounds for interesting types of innovation. Like, I'd like I don't know, like, what does it mean to have parallel compute with diffusion language models that generate all their tokens at once? Like, does that meaningfully change some sort of application? I don't really know. I think it would be like, the diffusion language model would be fun if it works.

Speaker 2

这样你在推理时间缩放上就有更大掌控力。Gemini就有这个功能，虽然具体改变还难以评估。但有了这些调节手段后，我希望它能推动些有趣的创新，毕竟并行计算是新领域，架构可以不断演进。我们拭目以待。

So you kind of have much more control over inference time scaling. I mean, like, Gemini has one. It's like hard to suss out what it changes. But once we have all these knobs, I'm hopeful that it helps build some interesting types of, innovation because like the parallel stuff is new, architectures can change. We'll see.

Speaker 0

我一直在用Codex Besseleben，感觉大部分生成结果彼此间只有5%的差异。

I've been using the Codex Besseleben thing, and I feel like most of the generations are, like, you know, 5% different from each other.

Speaker 1

因为你用Ruby。不不不，我...

Because you use Ruby. No. No. No. I have

Speaker 0

我用的是JavaScript版本。按理说应该没问题。不知道是不是RL编码的特性——我注意到这些模型遇到缺失变量时总爱用if语句来避免运行失败，我觉得这恰恰暴露了RL的缺陷。

a JavaScript one. I have a JavaScript one, so I should be good at that. I don't know if it's, like, just how the RL encoding works. One thing that I've noticed, these models always wanna do if statements when there's, like, a missing m variables so that it doesn't fail when it runs. And I feel like that to me, that's just, a symptom of the RL.

Speaker 0

是啊，这代码太糟糕了。就不该这么写代码——缺失变量时不该静默失败，应该直接报错。

Yeah. The code is terrible. Like, no you should not write code. Like, it shouldn't silently fail if there's missing variable. It should just raise an error.

Speaker 0

是啊。但我觉得这个URL似乎在把代码往这个方向推，然后所有生成的内容都有相同的模式。你看，我生成四个东西，它们全用了if语句，只是分散在不同位置。

Yeah. But I I feel like the URL is, like, pushing the code in this direction, and then all the generation have the same pattern. You know, I generate four thing, all of them use the if statement just in different pieces.

Speaker 2

对。这个问题我肯定会克服的。就像实验室在性能大幅提升和可用性轻微受损之间权衡取舍。问题是——你要发布那个模型吗？是的。

Yeah. That that was something I will definitely get over. That's just like the labs are trading off massive gains in performance or small detriments in usability. And it's do you ship that model? Yeah.

Speaker 2

比如可以先发布再处理后续问题，但我相信他们肯定能...我确定这是个可修复的问题。

Like, can ship it and deal with it later, but I'm sure they can I'm I'm sure that's a fixable thing.

Speaker 0

对我来说核心问题是：你说某些部分有改进，但整体流程未必。你觉得这些是典型案例吗？还是说如果我们延长开发周期——不仅要写这段代码，还要考虑后期维护和运行方式——问题就能解决？我有点难以把握。

I think like to me, that's the question is like, you know, you talk about how you have gains in like pieces of the thing, but not in the full trajectory sometimes. Do you feel like these are examples of that? Or do you feel like as we get better if we did a longer trajectory where instead of just writing this piece of code, you have to think about how you're gonna maintain it later and, like, how it's gonna run that's gonna fix it or it's hard for me to grasp.

Speaker 2

确实。软件问题很棘手，因为可维护性几乎像是人类主观偏好问题

Yeah. The software stuff is not easy because it's almost like maintainability almost feels like a human preference type

Speaker 1

没错。

Right.

Speaker 2

又来了。别人可能看着说'这不够好'，但添加启发式规则和权衡会非常混乱。所以或许...或许确实如此。

Issue again. Where somebody could look at it and be like, yeah, that's not as good. But adding the heuristic and trading seems very messy. Yeah. So so maybe it maybe it is.

Speaker 2

我...我不确定。这里有很多值得深挖的。Anthropic说他们在专攻代码领域，但这具体意味着什么？很大程度是设计上的权衡——模型自主权与长期训练副作用之间的平衡，而我们尚不知如何消除这些副作用。

I I don't know. There's a lot more to dig into that. I mean, like, this is what Anthropic says they're doing and just what are the actual frontiers in making like, they said they're working on code only and what does that actually mean? Mhmm. A bunch of it is gonna be designed trade offs and like how much autonomy the model has versus these potential side effects from training longer that we don't know how to get rid of.

Speaker 2

这种表现确实可能存在。我觉得这是相对容易修正的问题——可能模型只是执着于某种会在后续出问题的代码格式。就像大家见过的仅通过测试用例的情况。实际可能更复杂，但也可能非常简单。

I mean, that that definitely could be the sort of a behavior like that is what I would say is like a simple thing to remove, where it might just be obsessed with some code format that fails when you revisit it or something. If it even if it's like that everyone has seen it with just bypassing test cases. I think there'll be a bit more nuanced than that, but they could probably be super simple.

Speaker 1

这个话题让我联想到过度优化的语义内涵，就是你之前写过的那种情况。

This topic has a similar semantic content address for me as over optimization, which is something that you've written about.

Speaker 2

这是一种带有不同奖励函数的过度优化。

It is over optimization with a different with a different reward function.

Speaker 1

我我知道。我好吧。嗯，我我建立了那个链接。我想确认我们是在同一波长上讨论。我只是想再次回顾一下你花时间思考过的具体话题。

I I know. I okay. Well, I I made that link. I wanna verify that we we are picking on the same wavelength. I just wanted to go over, again, specific topics on things that you've you've spent some time thinking about.

Speaker 1

你写道有三种整体优化类型。第一种是用于控制的强化学习，第二种是RLHF（人类反馈强化学习），第三种是RL RL VR。它们总是会发生。显然，强化学习对奖励黑客并不陌生。但也许，你想详细说明一下我们作为行业在学习方面是如何演变的？

You write that there are three types of overall optimization. First was RL for control, second was RLHF, and third is RL RL VR. They always happen. Obviously, RL is no stranger to reward hacking. But like, maybe, like, do you wanna elaborate on how things are evolving in in terms of how we're learning as an industry?

Speaker 2

是的。那三种分类是为了让人们理解历史上发生的事情。所有这些过度优化都只是因为模型优化器足够强大，可以操纵代理与环境的关系，或以对其目标信号有用的方式操纵环境。另外，就上下文而言，我认为我们在语言模型和强化学习中所做的是，如果有东西能提高其奖励信号，它就会选择最容易、最直接的途径来提升那个信号。

Yeah. So that three things breakdown is for people to put the pieces together for what has happened historically. All of these over optimizations are a just the model optimizer is strong enough where it can manipulate the agent with respect to the environment or manipulate the environment in in a useful in a way that's useful to its target signal. Also, like, for context, I think with what we're doing with language models in RL in general is that if there's something that can move its reward signal up, it'll move the easiest thing. The most direct things to move that single up.

Speaker 2

这就是我之前提到的奉承现象的一部分，即用户反馈的奖励模型可能太明显了，人类就是喜欢那些，比如人们会按下大拇指

So that's part of the story that I said on Sychophancy, which is this reward model for user feedback was probably so obvious that humans just like to like stuff that is, like, people press that thumbs up

Speaker 1

按钮当他们看到填充的要点时。

button when they're Filled bullet points.

Speaker 2

是的。而且，这些对模型来说都非常容易提取。所以，一旦添加了这些，模型变化很大，分数上升很多，强化学习很容易找到这一点。在控制领域，最古老的强化学习中，环境通常是固定的模拟器，没有反馈。

Yeah. And, like, all those things have just been really easy for the model to extract. So, like, once they added it, the model changed a lot, and the score went up a lot, and it was easy for the RL to find that. In Control, the oldest RL, the environment is normally a simulator that is fixed. There's no feedback.

Speaker 2

所以过度优化表现为不物理和荒谬的行为。比如摩托艇绕圈的例子。还有一个例子是我参与的一个项目，实际上是过度优化了半猎豹（Majoco中的东西）。它没有跑步，而是带着车轮驶向夕阳，获得了无限的数字。这显然不是预期的目的。

So the over optimization looks like unphysical and nonsensical behaviors. There's the motorboat example going in circles. There's like an example is a project that was middle author on was, like, effectively over optimizing, like, half Cheetah, which is this Majoco thing. Instead of running, it took car wheels off into the sunset and got, like, infinite numbers. It's like, obviously not the intended purpose.

Speaker 2

看起来像个故障。所以它只是在操纵代理与环境的接口。RLHF是一个经典案例，模型会因为奖励模型不完善而崩溃。所以在RLHF案例中，环境非常不完善

It looks like a glitch. So it's just kind of manipulating the the agent interface with the environment. RLHF is kind of a classic case where the model will just break down because the reward model is imperfect. So like the environment is really imperfect in the RLHF case where

Speaker 1

它太稀疏了。非常人为。

It's so sparse. It's like very artificial.

Speaker 2

是的。这是一个非常人为的环境。因此，这些由生成的标记产生的行为会逐渐简化为重复同一个标记，这是有道理的。就像我们早期在Hugging Face玩这个时的一个例子，模型只会不断输出“JavaScript”。它会变成JavaScript JavaScript JavaScript。

Yeah. It's a very artificial environment. So it makes sense that these actions which are generated tokens will do things like reduce into just repeating one token over again. It'll be like I think one of the early examples we had playing with this at Hugging Face was the model would just say JavaScript. It would be JavaScript JavaScript JavaScript.

Speaker 2

当时就像，哦，某个玩具数据集。当你看到时非常明显。但当你处于顶层，在进行大量RLHF时决定何时停止训练，可能更难察觉。但这是人们经历过的阶段，现在我们进入了RLVR阶段，即当模型做“正确”的事情时给予奖励。对于数学来说，过度优化可能更难，除非你有工具且模型学会了搜索和作弊而不是学习数学——我确信现实中有人遇到过这种情况。

It was like, oh, some toy dataset. And it's very obvious when you see it. It's probably harder to see when you're at the top and making design decisions on when to stop training if you're doing a lot of RLHF. But that was kind of the phase that people have gone through and now we're in the RLVR phase, which is we're giving the model reward when it does something quote unquote right. For math, it's a bit harder to over optimize, I think unless you have tools and the model learns to search and cheat instead of learning math, which I'm sure somebody could see that out in the world.

Speaker 2

就像，模型会说‘哦，你正在用斯坦福CS某课程的习题集训练我，这题我见过上千次了。所以我直接去翻答案手册吧’——我肯定有人能找到真实发生的例子。但在代码和信息检索上更容易糊弄。比如通过单元测试的最简单方法就是写个pass——模型学会这种操作并不奇怪。

But just like, oh, I'll just find the you're training it's like the model's like, oh, you're training me on Stanford's problem set for CS whatever that it's seen a thousand times. So it's like, I'll just go get the solution manual, which I'm sure there's somebody can find an example where that has truly happened. But on code and maybe information retrieval, it's easier to fudge. So the code thing is like the easiest way to get a unit test and pass is just put a pass on it. Like, that is not too surprising that a model can learn how to do that.

Speaker 2

因此在代码领域需要更精细的奖励设计。我认为一个重要的学术工作应该是：如何设计代码奖励来平衡这种对测试用例的过度优化或避免失败等问题。虽然模型很复杂难以完全控制，但应该能在某些方面复现这种现象。

And there for code, you need more reward design. Do I think would be a nice for like a substantial academic work is like, what is reward designing code for balancing this sort of like understanding this over optimization of test cases or avoiding failures or something like this. I'm sure there's it's not necessarily gonna be a controlled environment because these models are complicated, but I would guess you can reproduce that in in some ways.

Speaker 1

再确认下，奖励设计是指比如对部分正确的工作给予部分分数？

Just to double click, reward design means, like, for example, giving credit partial credit for partially correct work.

Speaker 2

对。或者当检测到模型在单元测试中作弊时给予轻微惩罚。是的，针对作弊行为。

Yes. Or, like, giving the model a slight penalty for doing the unit test thing if you can detect it. Yeah. For cheating. Yeah.

Speaker 2

这使得训练模型比数学复杂得多——数学只需判断答案对错。比如GRPMath中的部分得分是按批次标准化的，我不确定是否有完整解释。当混合不同领域时会更复杂：代码的部分得分是否优于数学的部分得分？奖励设计变得极其复杂，因为这决定了你激励模型做什么。

Which is it adds a lot of complexity to training these models compared to math, which is just if the answer is right. This is I mean, you can look at the GRPMath and partial credit is weird in that because it's kind of normalized per batch. I don't know if I have a whole spiel ready on it for that, but it's also just it becomes very complicated if you're mixing domains and it's like, is partial credit in code better than partial credit in math or all of these things? It's like reward design becomes very complicated and that's what you're incentivizing the models to do different things.

Speaker 1

明白。是否有关于混合这些方法的文献或假设？比如单独有效的代码验证器、数学验证器等，它们组合时会冲突吗？

Yep. Is there any literature or hypotheses about mixing these things? So let's say you have the the one for code, you have the one for math, you have whatever other verifiers you can come up with, and individually they work? Do they conflict?

Speaker 2

RLVR的直觉在于模型能识别提示领域——这就是为什么仅训练数学或精确指令遵循时，模型在知识基准上不会变差。模型会形成对不同提示空间的直觉，梯度更新因批次而异，所以人们建议用大批次来激活更多模型参数并获得更清晰的RL信号。另一个直觉是模型能自行处理这些。关于训练顺序也有趣：是否先进行大规模数学和代码RL训练序列长度，再加入通用内容？

I think part of the intuition of RLVR is that the model is good at knowing which prompt area it is, which is why the models, like, don't get worse on knowledge benchmarks if you're training on, like, just math or precise instruction following. So the model just kind of develops an intuition for like where the different prompts are in space. So the gradient updates will be different depending on your batches, which is partially why people just say do big batches. So like a lot of the model is activated and you have a less noisy signal with RL, But a lot of intuition is that the model just kind of handles that. And there's interesting questions on sequencing, like, do you do large scale math and code RL to get the sequence length and then add in more general stuff?

Speaker 2

DeepSeek报告提到这点：从数学和代码过渡到更通用的RL。问题在于何时引入工具（如代码执行和搜索），是在交织阶段还是后续阶段。懂了。

Yeah. Which DeepSeek mentioned, but that's one thing to go the DeepSeek report is, like, math and code to more general RL. There's a question on where do you do tools if you're gonna do, like, code execution and search within this. So I don't know if that's interweaved or if it's a second stage. Got it.

Speaker 2

是啊。

Yeah.

Speaker 1

我我我对此没有评论。只是觉得，有这么多未知的东西需要大量计算来消融实验，这挺让人惊讶的。

I I I don't have comments there. It's just like, it's surprising how much is not known and you just need a lot of compute for ablations.

Speaker 2

长推理生成的推理过程简直会摧毁所有基础设施。生成的令牌数量太多了，更容易出现内存溢出或其他问题。

The inference the high inference length generations definitely just like kinda breaks all infrastructure. There's just so many tokens. It's more opportunity for out of memory or other things go

Speaker 1

出错。

wrong.

Speaker 2

所以默认情况下，所有训练任务都需要更多GPU来应对推理内存需求。当然，或者单纯训练本身也是如此。这只会让事情更麻烦。

So it's like just on a default, all of your training jobs need way more GPUs for the memory of inference. Sure. And or just like training. It's just it's just makes it more of a pain.

Speaker 1

没错。这是个成本问题。你知道，Gnome产品可能引发争议的一个结论是——你也听了——从环境中获取反馈的绝对耗时，特别是涉及现实世界场景时。我当时就觉得，确实存在某个临界点，训练周期总不能比人类寿命还长吧，这就是我说的那道墙。他（笑）不同意这点，但这就是我的意思。

Yeah. That's a cost thing. You know, one of the maybe controversial takeaways from the Gnome prod, which you listened to, was that there's also just wall clock time of just getting feedback from the environment, whatever that is, especially if it's like a real world thing. And I'm just like, yeah, there I mean, there's some point at which your training runs cannot take longer than like a human life, like so to me that was the wall. He he he disagreed with that, but like that that was what I meant by

Speaker 2

这个。

it.

Speaker 1

就像长推理最终总得在合理时间内终止，这是用户的基本需求。我们必须在训练过程中找到比宇宙时间流逝更快的加速方法。

Like at some point you long inference, you you do want it to terminate within some reasonable amount of time regardless just as a user. Yeah. We have to find a way to accelerate internally within the training time faster than the passage of time in the actual universe.

Speaker 2

对。我倒不担心这个问题，但原则上同意你的观点。

Yeah. We're not I'm not worried about that problem, but I agree with you in principle. Right.

Speaker 1

所以我...我扯太远了。明白了，明白了。

So I'm I'm I'm stretching this out too far. I get it. I get it.

Speaker 0

当我们接近尾声时，还有哪些值得人们探索的有趣想法？比如在你的AI演讲中提到的，关于URL扩展的思考，你们拥有大型多领域数据集、筛选困难、运行时间长等问题。是否有某些具体方向，无论是研究者还是创业者，那些你不想亲自做但希望其他人去探索的有趣领域？

As we kind of start wrapping up, what are other interesting ideas that people should pursue? Like in your AI talk, you said, what I'm thinking about for scaling URL, you had big multi domain datasets, difficulty filtering, long run times. Is there anything specific that if there's people out there that are either doing research or they wanna do a company or whatever, these are like interesting things that you don't wanna do, that you want other people to explore?

Speaker 2

其中大部分我认为不在推理领域。但就像他们演讲聚焦推理那样——我长期在谈论性格训练是个被低估的方向，最近还在指导一个学生研究这个。性格层面的？就是人格训练。

Most of them, think, are not in the reasoning space. But like, if their talks have been about reasoning. So I've been long talking about, like, character training is something that I think is under indexed on and been advising a student that's Character level? Like personality training. Okay.

Speaker 2

以及如何通过提示激活、微调或数据工程等不同方式改变模型个性。就像Joanne Jing在OpenAI做的工作——这类方法到底有多重要？

And how that like like different ways of changing the personality of the model from prompting activation or fine tuning. Okay. Or like data engineering. So stuff that like Joanne Jing does for OpenAI. So like like like how much does that matter?

Speaker 2

哪些是基础性研究？希望我能分享更多这方面的内容，而不仅是给学生建议。这个观点我已经坚持很久了。

What are the fundamental research things? Hopefully, I can share more of that than advising a student on that. So I've been saying that for a while. Do you

Speaker 1

顺便问一句，你喜欢她做的模型规范相关工作吗？好的。

just as a side side note, do you like the model spec stuff that she's doing? Yeah. Okay. Yeah.

Speaker 2

那个研究方向。是的，我早期就是其拥护者。实际上正是因为她注意到我是最初发布时唯一报道此事的人——大概一年多前吧。

That that that trajectory. Yeah. So I've I've been a early fan of that. I mean, that's how she final that's how like she noticed me as I was like the only person that covered it when they first released it. I think it was, like, over a year ago.

Speaker 1

我当时就很欣赏这个方向。

I was I liked it.

Speaker 2

确实...不过当时认同的人不多。

I Yeah. Well, not many people did.

Speaker 1

明白。好吧好吧，你是第一个发现的。

Okay. Alright. Alright. You were first.

Speaker 2

不确定...但她确实是这么跟我说的。我们之前有过一个模型...

I don't know. I don't know. But, like, that's what she said to me. Well, we had a, you know, we had a model

Speaker 1

规格讨论作为整个会议的压轴环节，对吧？就像这是我发出的信号，让大家注意这个部分。

spec talk close the whole conference. Right? Like, that was my sign of, like, pay attention to this, guys.

Speaker 2

但它的真实性在于向开发者传递的信息——比如模型的发展方向具有开发价值。同时从监管角度看，我认为明确区分故意行为与训练错误至关重要。所以对模型透明度而言，这确实很棒。

But it's it's real because of what it sends to, like, develop it has a developer benefit of, like, where your model's going. And then also just, like, regulatory. I think it is very important to, like, what is, like, an intentional behavior versus just, like, a training error. Okay. So I I think for model transparency, it's really fantastic.

Speaker 2

我说过模型规格比宪法实用得多。宪法更像是训练过程中给算法的中间产物，用来塑造目标模型。我们不会用宪法形式写下模型目标。

And I've said that, like, the model spec is much more useful than a constitution. Constitution's like an intermediate training artifact that you give to the training algorithm in order to get the model that you want. It is not necessarily like what model did we like, we don't write down our goals of the model in a constitution form.

Speaker 1

顺便问下，你看过那份宪法吗？不太高明。他们讨论时加入了苹果的设计指南，还有联合国宣言之类的内容。

By the way, have you looked at the constitution? Not very smart. They talked about it. They they put in like Apple's like design guidelines Yeah. But then also like the UN like declaration of.

Speaker 2

这个层面我见过，甚至不确定他们是否更新过。很怪异。我希望Anthropic能写个模型规格，不过不太乐观，但他们是下一个该跟进的。

So at this level, I've seen it. Don't even know if they've updated it. That's very odd. I hope that Anthropic would write a model spec. I'm not too optimistic, but they're the next domino to fall.

Speaker 1

其实我的看法是——我提议得太晚了，OpenAI已经批准了演讲。我本想让他们比较OpenAI模型规格和CloudForest系统提示，那是他们最接近规格的东西。

Well, so my take on that actually, I I pushed for this too late because OpenAI already approved the talk and all that. But that I was gonna ask them to compare the OpenAI model spec to the CloudForest system prompt, which is their closest thing to the model spec.

Speaker 2

系统提示不完整，因为OpenAI模型规格里有些功能当前模型做不到，尤其是初期。比如刚发布时写着希望模型能讨论敏感话题甚至NSFW内容，这属于目标信号。他们坦言实现难度大，但理想模型就该如此。正如我说，这对多方利益相关者都有益。主要问题是缺乏相关基础研究论文。

It's the system prompt is incomplete because OpenAI has things in the model spec that their model doesn't currently do, or especially when they started. It's like we want to when they first released it, it was like we want the model to be able to engage on like sensitive subjects and maybe like even NSFW is in their model spec, which is they're just signaling of what they wanted to do and they say like, this is very hard to implement because there's all these obvious risks to doing this. But it's like in an ideal model where we can solve every problem, this is what we do. But I think is good, as I said, many different stakeholders. So I mean, mostly, my thing is like there hasn't been a good, like, foundational research paper on that.

Speaker 2

要做的事很多。还涉及个性化问题——如果开源模型要胜出，关键在于每个人都能定制自己想要的模样。现在提供GPT-4.5这类产品，你可以用提示词调整。但如果微调比提示更有效，人人都能获得专属模型。

That's just a lot to do. It also runs into personalization and personality or similar, which is like if open models are to win, part of it could be just like everybody can have exactly the model they want. We're serving GPT 4.5. It's kind of thing, you can prompt it. But if fine tuning is more effective than prompting, everybody can have the model that they want.

Speaker 2

这更像学术问题或开放生态问题，人们在更可能获胜的领域竞争，这是好事。

So it's a good it's it's like a an academic problem or an open ecosystem problem where people are fighting on the turf that feels more likely to win Yep. Which is good.

Speaker 1

作为AI对Omo发言时，你是想赢得这个领域，还是仅仅提供建议？

Is this somewhere where you, like, as speaking as AI to Omo, you you want to win? Or is this you're just advising a grass in on it?

Speaker 2

我认为目前这还不是一个差异化因素，但我非常愿意在这方面进行探索

I don't think it's a differentiating factor as yet, but I'm very open to working on

Speaker 1

因为我觉得开源模型在角色扮演、个性化等应用场景有很强的潜力。毕竟人们找到自己的'二次元老婆'后就想一直拥有她——虽然这个说法有点戏谑。

it. Because I think like open models have a strong, you know, role play use case and, you know, like, character, personalization, all that stuff. Right? Especially because people, like, they find their waifu, they wanna keep their waifu. And, like, that's the derogatory term for it.

Speaker 1

不过

But, like

Speaker 2

我们确实讨论过这个问题，我认为理想状态应该是：基础模型能轻松适配用户需求方向。我们在人格设定上会保持相对保守的立场——之前复盘OpenAI模型时，我们认同他们多数关于拟人化的保守处理，虽然具体分歧点记不清了。

I would say that we've definitely discussed it and I want to part of almost should be that it is a base model that's easy to take in directions that you want. And we will have an opinion that is probably slightly conservative on personality. I mean, I've gone through the open AI models back and it's like most of these who we agree with and like be conservative on anthromorphisation. What disagree with? I don't remember.

Speaker 2

几个月前做的评估。核心在于开放透明：如果我们训练开源权重的人格模型，绝不会刻意保留任何信息。我们的价值排序不同，更注重信息交换而非表面友善。

I did it a couple months ago. But a lot of it is like openness or transparency, which is like, if we're training an open weight model personality, like, not gonna withhold anything. Yeah. And we have a different hierarchy. So most of them are on, like, that type of information exchange rather than be kind.

Speaker 2

比如OpenAI的模型准则就很温和，通篇都是'尊重用户'这类原则

Like, opening eyes model stack is pretty agreeable and if you you read through it and it's, like, treat the user with respect and all these, like

Speaker 1

像教育孩子一样。直接看规范就行

Raising kids that way. Just read the spec.

Speaker 2

听起来有点蠢。但对研究者来说，最酷的是可以玩模型路由——从Hugging Face调用多个模型组合。开源工具在这方面的灵活性远超OpenAI产品，毕竟后者只能用自己的模型。

Yeah. It sounds kinda stupid. But then the last thing is for people doing research, it's like wacky model routing things where you figure out like a bunch of different models to off Hugging Face to route to. Because an open model tool thing could use way more models more easily than any OpenAI product. Because OpenAI is restricted to the OpenAI's models Mhmm.

Speaker 2

就像OpenRouter做的产品化尝试：他们的聊天窗口能根据使用数据自动选择最佳模型响应查询

Where if, like, maybe, I don't know, OpenRouter is, like, I'm gonna make a product out of this, which is a router. Like, OpenRouter actually does it, and they're like, our chat window knows the best model based on all this usage that we have Yeah. For your query.

Speaker 1

也有反向创业的，比如Martian（不是Diamonds），具体还有谁他应该清楚。

There's people that started other way, like, Martian, not Diamonds, I I don't know who else is. He would know.

Speaker 0

有很多很多。

There's there's a bunch.

Speaker 1

有很多。

There's a bunch.

Speaker 2

是啊。所以我不确定这能否行得通。Hugging Face应该研究这个。这就像是个登月计划般的想法。你无法预知它何时会实现

Yeah. So I don't know I don't know if that would work. Hugging Face should work on it. It's like like an it's it's a moonshot idea. You don't know when it'll

Speaker 1

鉴于你在Hugging Face的背景，Hugging Face究竟靠什么盈利？这是个被频繁调侃的经典问题。

Given your Hugging Face background, what is what is how does Hugging Face make money? This is a very common meme question.

Speaker 2

我认为主要是企业级合作。他们是这么说的。没错。他们专注于自己的领域，提供支持。

I think mostly like enterprise deals. That's what they say. Yes. Like, they're doing their thing. They're supporting.

Speaker 2

他们很刻薄。

They're mean.

Speaker 1

是的。听着。他们很优秀。规模庞大。而且盈利。

Yeah. Look. They're they're great. They're big. They're profitable.

Speaker 1

只是对大多数人来说不太明显。我

It's just not that obvious to most people. I

Speaker 0

喜欢媒体模型的路由器概念。感觉存在一个长长的长尾，就像

like the router idea for media models. I feel like there's, like, so there's, like, a long tail of, like,

Speaker 2

一个生成式的。像是，一个

a The generative. Like, a

Speaker 0

风格供应商。这实际上很难找到。在技术方面，我觉得除非你有延迟或价格限制，否则就应该用最好的模型。即使我们在做缩略图时，我也是这样想的。比如我要移除某人的背景。

style supplier. Like, that that is actually hard to find. On the tech side, I feel like just use the big unless you're, like, under some light latency or price constraint, you should just use the best model. Even when we're doing thumbnails, I'm like, okay. I'm trying to remove a background of somebody.

Speaker 0

然后我去Replicate平台，发现有55种背景移除工具。

And it's like, I go and replicate, and there's like 55 background remover Yeah.

Speaker 2

我直接用Adobe，因为它是个网站。

I just use Adobe because it's a website.

Speaker 0

但那不行。Photoshop的模型在某些方面很差。但话说回来，比如我想生成一个图表来模仿某个东西，这时候哪个模型更擅长图表呢？

Well, that doesn't work. Like, the Photoshop model is bad on some things. But but again, it's like or I wanna generate a diagram to, like, mimic something. And it's like, well, which model is better diagrams? Yeah.

Speaker 0

要知道这些信息很难找，因为基准测试里都没有。

You know, it's like, those are not easy to find because none of the benchmarks

Speaker 2

部分论点在于，如果蒸馏法效果很好，我们可以不断缩小蒸馏目标——最终得到非常专精的小模型。比如用1亿参数的模型实现Markdown转LaTeX的表格重排版。如果模型足够小，经济上就完全可行，因为成本近乎为零。

are Part of the argument is that if distillation works really well, we could just keep making the target for distillation smaller and smaller, which is you have models that are very narrow. Right. And they're mimicking these huge models on something that's like pre I don't know, like reformatting tables. It's like, can you do a table reformatter from markdown to LaTeX in a 100,000,000 parameter model? Like like, if you get it small enough, that is really economically feasible because it's effectively free.

Speaker 2

它能即时烙印执行。

It imprints and instantaneous.

Speaker 1

我的反驳意见是：如果是图像编辑，就应该让模型全包。

My pushback is on this is just if you're doing image editing, four o should do it do it all of it.

Speaker 0

话是这么说，但我觉得它其实...

Well, yeah. But I think it does, like It's

Speaker 1

只是技术还没成熟。给五年时间，肯定能做到。所以何必现在费劲开发...

just we're just not there yet. Like, give it five years, It'll do it. Right? There's that so why work on a

Speaker 2

路由器到底有没有？你直接扩展四个就行。我想是的，没错。

router at all? You just scale up four o. I guess I it's yeah.

Speaker 1

对吧？比如，这里的逻辑在哪。这就像是个临时方案。

Right? Like, me where the the logic is here. Like, this is like a temporary thing.

Speaker 2

在设备上？在设备上。本地建模社区的规模其实比人们想象的小得多，因为开放模型的主要用途还是在API上。就像深海，API很方便。

On device? On device. Like, the local modeling community, I think, is much smaller than people give it credit for because most of the use for open models is still in APIs. It's like deep sea It's convenient. API.

Speaker 2

这很方便。而且如果模型不多，总会有人以比大多数人自建更便宜的价格托管。这很现实。但确实存在需要本地运行的小众群体。

It's convenient. And it's like, if there aren't that many models, somebody's gonna host it for cheaper than most people's doing it themselves. That's pretty realistic. But there there is a small community that need local.

Speaker 1

是啊。

Yeah.

Speaker 2

最理想的情况是开放模型不仅能在长尾领域竞争，但这需要最大程度的转型。

The best outcome is if open models can compete on not just long tail things, but that takes the most transformation.

Speaker 1

插一句，所以我当初自己买GPU搭建集群就是为了这个。我觉得API能解决大部分需求。那些给我提供模型服务的人都在亏钱，我干嘛要自己搞？除非...去年40.9美元的价格翻倍了，所以搞本地模型反而赚钱了。

Side note, so I resisted by building my own, like, buying my own GPUs, building my own cluster for for this reason. I'm like, APIs will will solve most of it. Like, people are losing money to serve me models. Why am I, you know, having those? Except for the fact that $40.90 prices have doubled in the last year, so you actually made money doing local models.

Speaker 2

这怎么让你赚钱的？因为你的投资增值了？

How does that make you money? Because your investment goes up?

Speaker 1

你卖显卡就能赚。

You're selling card and you do it.

Speaker 2

所以随着你使用40.9美元，它就升值了。有意思。

So as you use $40.90, it goes up. Interesting.

Speaker 0

真该买个四十块九的。

Should've bought a forty ninety.

Speaker 1

我买了个四十块七的。

I bought a forty seventy.

Speaker 0

该死。

Damn it.

Speaker 1

这是什么情况？好吧，那那那这样我就得等到...比如，我是不是该买那个15.9美元的，如果它真的，你知道，广泛上市的话。

What is this? Well, then then then it puts me until, like, should I buy, you know, $15.90 if if it ever, you know, is is widely available.

Speaker 0

GDC那时候他们在搞限量发售。

GDC, they were doing the drops.

Speaker 1

是啊。我

Yeah. I

Speaker 0

知道。当时太疯狂了。人们都，像疯了一样跑向...我知道。跑去露营车那儿抢购。

know. It was crazy. People are, like running to I know. To the camper to buy it.

Speaker 1

在我提出最后一个问题前还有其他话题吗？比如你的工作、ROVR之类的今日议题。我觉得

Any any other topics before I give a closing question? Just generally your work, ROVR, like, are topics of the day. I think

Speaker 2

企业应该继续考虑发布开源模型，主要是为了公关和用户引流。既然OpenAI都在这么做，看来这是大势所趋。

companies should keep considering releasing open models mostly for PR and onboarding. It seems like the way it's going if OpenAI is releasing it.

Speaker 0

你对这事儿兴奋吗？你觉得这像是一种心理战术吗？

Are you excited about that? Do you feel like it's like a psyops like

展开剩余字幕（还有 48 条）

Speaker 2

OpenAI的模型会很出色。

The OpenAI model will be good.

Speaker 1

我本来期待不是的，老兄。他们相当认真。它会成为

I expected No, dude. They're pretty serious. It'll be

Speaker 2

某个规模类别和任务子集中的顶尖水平。OpenAI就擅长这类事情。你得给予他们应得的尊重。是的，超乎寻常。

best in class for some sized category and some subset of tasks. That's like OpenAI only does things like that. You have to give them the respect they want serve. Yeah. Superbable.

Speaker 2

而且，这是个巨大的，当更多人参与时OpenAI就会赢。所以这算是他们的胜利。是啊。不过，我

And like, that is a big, like like, open wins when more people are doing it. So like, that's that's a win. And Yeah. Well, I

Speaker 1

希望他们能真正公开技术细节，而不仅仅是权重参数。

mean, hopefully, they are actually open about the techniques and not just the weights.

Speaker 0

你们认为开源模型的规模能透露他们将要构建的硬件信息吗？

Do we think the size of the open model is tells us anything about the hardware that they're gonna build?

Speaker 1

不，怎么会？

No. What?

Speaker 2

不会。他们对此极其保密。这就是为什么他们没发布GPT-3.5之类的东西——因为会泄露内部信息或计划。

No. They're so secretive about this. That's like, that's why they haven't released GPT 3.5 or anything because it's too revealing about internal stuff or plans.

Speaker 1

哦好吧。不，所以我...你是指星际之门项目？还是哪种硬件？

Oh, okay. No. So I I I are you talking about Stargate? Or what do you what what kind of hardware?

Speaker 0

强尼·艾。

Johnny I.

Speaker 1

不。对。问题是那是个不同的设备形态。对。就是这样。

The no. Yeah. The thing is that's a different file factor. Yeah. That's it.

Speaker 1

对。对。

Yeah. Yeah.

Speaker 2

我觉得那东西会在云端运行。反正不认为它能本地运行。

I think that thing will run on the cloud. Don't think that'll run local anyways.

Speaker 1

好吧。我们得谈谈这个。好像每次播客我们都在讨论这个。显然今天的新闻——你应该也看到了——是说他们因为某个耳戴设备被起诉了。但我认为那种耳戴形态其实挺不错的。

Well, okay. We have to talk about it. Like, seems like every podcast we talk about it. So apparently, the news from today, which I think you were looking at, was that it was like a ear device that they sued they they got sued over or whatever. But, like, I think that ear form factor is pretty good.

Speaker 1

说真的，我和B讨论过这个问题：这技术最终会发展成什么样？你希望AI能听见你听见的声音，而你在哪里接收这些声音？耳朵上。基本就是这样。不知道你们对可穿戴设备的发展方向有什么看法。

Like, I actually did get there with b in terms of like, where where does this ultimately go? Like, you want something you want the AI to hear what you hear and where do you hear what you hear it? On the ear. Like, that's pretty much it. I don't know if you guys have like thoughts on wearables and where that goes.

Speaker 0

我尽量...我觉得它知道得太多了。这真的是我的...

I try to be. I I think it just knows too much. That's my that's really

Speaker 1

但你需要给它提供上下文。

my But you want to give it context.

Speaker 2

是啊。我对隐私抱有虚假的希望。像大多数人一样——这才是关键。人们其实并不真正关心隐私。

Yeah. I have false privacy hopes. I think, like, a lot of people I mean, that's the whole thing. It's like people don't actually care about privacy.

Speaker 1

这只是做笔记，你知道的。就是非常好的记忆辅助。

It's just note taking, you know. It's just really good memory.

Speaker 2

我觉得Meta雷朋的形态不错。但不算大众市场。如果能做成AirPod那种尺寸，市场会大得多——原因显而易见。不过太阳镜形态确实是可行的方案。好吧。

I think the meta Ray Ban form factor is good. I don't think it's as mass market. It's like if you get it in the AirPod size form factor, it's a way bigger market for obvious reasons. But the the, like, sunglasses form factor is the thing that works, I think. Okay.

Speaker 2

我不用它们来做AI，但它们可以适配AI来运作。

I don't use them for AI, but they can fit the AI to work it.

Speaker 1

就像，确实。从经验来看，显然它是有效的。

Like, yeah. Empirically, yeah, it it obviously works.

Speaker 2

没错。

Yeah.

Speaker 1

酷。我留到最后的这个问题是关于Meta的整体动向。记得你在四月份发过一篇挺有意思的帖子，说Llama Ford和Meta当时按下了恐慌按钮。

Cool. Well, the last question I was saving up was this whole, what what is Meta doing? You know? You had a you actually had a pretty interesting post back in when was this? In April, you said Lama Ford and Meta just pushed the panic button.

Speaker 1

我觉得当时他们其实没真按，但现在是真的按下了恐慌按钮。

I feel like back then, it didn't actually push the panic button, but now they really pushed the panic button.

Speaker 2

这说法公允。当时的恐慌按钮是指他们发布的并非LMSYS模型这件事，加上发布日选择之类的古怪操作。但号称开放却不发布主打模型，这纯粹是执行失误。严重的执行失误。

That's fair. I think the panic button at the time was the whole LMSYS model not being the model that they released thing, along with a bunch of weirdities about like the day of the week they released. But to be a model that claims to be open and then not release the model that is your leading claim is just like that that is like bad execution. Bad execution. Yeah.

Speaker 2

最近的情况基本可以归结为：人才成本远低于GPU。到最后他们会发现，与其砸钱买硬件，不如投资顶尖人才——反正总得试试。

Yeah. Which is fine. And then the recent stuff, I think, mostly can be boiled down to talent is cheaper than GPUs by a dramatic margin. And at the end of the day, it's like, okay, like, if we're spending this much, it's go to the room and they stare in the mirror, they're like, wait, it might not actually be that ridiculous to spend this money on the top people. It's like, might as well try.

Speaker 1

他们早把钱砸在VR上了。

They already spend it on VR.

Speaker 2

迟早有人会这么做。如果是苹果或其他公司决定入场，他们也会采取和Meta完全相同的策略。

Somebody was bound to do this eventually. And it makes sense that it's like the it's like if Apple or to some way somehow decide like we're gonna do this, they're gonna come in and do exactly what Meta is doing.

Speaker 1

他们需要创始人模式的CEO，能拍板说'管他呢，亏就亏'。我在想Meta与其投资VR，不如搞RLVR...不过问题在于，多数研究者会拿着高薪欣然跳槽。

They need a founder mode CEO who's like, screw it, like, you know, we'll take the l. The the thought that that occurred to me is, know, Meta instead of spending on VR, they should spend on RLVR and oh, well, think the question is like, I think a lot of some researchers, like most people will take a payday and happily move

Speaker 2

每个人都有一个贿赂数字。对吧。只不过这个数字真的很大。

to Everybody has a bribe number. Right. It's just just the number really big.

Speaker 1

是啊。但我觉得有些研究者对‘伟大人物研究理论’这个概念感到不适，比如你必须支付这么多钱才能获得这种级别的人才，而人才分布显然是不均衡的。

Yeah. But like, I think some researchers are uncomfortable with the idea that this is a sort of the great man theory of of research that, like, you have to pay this much to get this level of talents and The talent is definitely distributed. Yeah.

Speaker 2

没错。那些值得高薪聘请的人往往有自信重构事物或重复某些工作，无论你称之为感知AGI还是建造事物的驱动力——这与硅谷历史上许多传说并无本质不同。关键在于有愿景并愿意执行的人，他们能预见未来。这类人能带来巨大改变，前提是你要清除官僚主义。

Right. A lot of the people that they would be paying this much have the confidence to redo things or to just do some of the same things and just like, whether you call it feeling the AGI or just drive to build things or like, feeling the AGI is not that different than a lot of things that have existed in Silicon Valley lore in the past. It's just people with the vision that are willing to execute on it and they see something coming. And, like, those people make a big difference. I think you have those people and you remove bureaucracy.

Speaker 2

Meta其实拥有或能够获取大量技术型研究人才。但这就像不断回收利用，对个人和组织士气损耗很大。不过这就是他们的策略。

Getting technical talented researchers is actually something that Meta has a lot of or has the ability to get a lot of. So it's like the it's a lot of recycling, which is very hard on individuals and morale of an organization. But that's like understand the approach.

Speaker 1

确实。明白了。我就说这么多。

Yeah. For sure. Cool. That's all I have.

Speaker 0

最后谈谈你打算如何打造美国版深度求索？那条推特挺有意思的。

Any parting thoughts on how you're gonna build the American deep seek? That was a nice tweet.

Speaker 2

如果要问我十年目标，其实我只有两到五年的规划。随着模型向智能体转变，我认为扩展速度正在放缓。要实现类似美国深度求索的目标，固定成本和路径已经明确。其实完全开源就不必限定美国——只要可自由修改，虽然还有些环节需要突破。

Yeah. Mostly, if I have to look at like what my in the if you were asking me, like, what my ten year goal is, and it's like, I only will have, like, a two to five year goal where I think as models are shifting more towards agents, I think that, like, scaling is slowing. It's like their side of it of a fixed cost and a fixed path to getting towards something like American deep seek. Or mostly just I would say it doesn't have to be American if it's fully open. If you have everything and you can modify it, which is like, there's a few things that need to fall.

Speaker 2

本质上需要更多资源，粗略估算320亿就能达到原始GPT-4级别的完全开源。但还要经历几个阶段：这个密集模型需要稀疏MOE化，需要扩展GPU集群，最后实现大规模推理。

A lot of it is just more resources, but it's like like almost 32 b is if you squint like original GPT four level and fully open. And it's like there's a few levels that you need to go through. Like, that's obviously a dense model. It needs to be taken to sparse MOE, and you need to scale it. You need to have a lot more GPUs, and then you need to do, like, large scale reasoning.

Speaker 2

这就是我的目标方向。但涉及如何与AI协作的复杂性——AI需要达成什么条件才能实现目标？这非常困难。

It's like, that's the goal that I want to do. There's a lot like, that's what I want to do. There's a lot of complexity in navigating, like like, how to work with AI. Like, what does AI to do to get there? It's very hard.

Speaker 2

嗯。作为非营利机构确实难以获取资源，构建模型需要协调多方力量。深度求索的成功在于他们拥有杰出人才，而OpenAI也长期留住了许多优秀研究者。

Mhmm. I think that I mean, it's a nonprofit. It's hard to get the resources, and building a model is a lot of aligning a lot of different people. That's the deep sea story is they have great people. OpenAI has kept a lot of really good people for a long time.

Speaker 2

Anthropic现在聚集了很多优秀人才，这就像是要解决一系列渐进式的艰难技术问题，需要层层突破。这就是我想在未来几年内实现并推进的事情，但达到目标并不容易。所以关键在于，AI2的最佳情况是AI2会涉足其他领域。你不能仅仅运营一个非营利组织或公司，声称我们的目标是三年内打造一个美国版的深度求索。没人会持续为这种空洞承诺买单，因为你需要讲一个更有说服力的故事。

Anthropic has gotten a lot of good people right now, and it's like, it's a lot of incremental hard technical problems that you need to stack up. Like, that's what I would like to do and make work in the next couple years, but it's not easy to get there. So that's that's the pitch is like, AI two's best case scenario is AI two's gonna do other things. Like, you can't just run a nonprofit or a company that says, our goal is in three years to have an American deep seek. Like, no one's gonna keep paying the bills on that because you have to tell a better story.

Speaker 2

但这正是我想努力的方向。我相信AI2在此过程中还会实现更多有趣的事情，比如产品层面的创新——虽然不确定是否一定是产品，但更像是探索AI领域的前沿，为某些需求设计全新架构。

But that's like what I would like to do in that. And I'm sure AI too will do many more interesting things along the way. Like product stuff. I don't know if it's necessarily product, but like, what are more like, what are cutting edge things in AI that we can make a new architecture for certain things? Okay.

Speaker 2

或是展示开源模型如何更好地运作，无论涉及私有数据还是天马行空的创意，这些都可能让我们突破Transformer的既定轨道。我认为要引领AI发展，持续进行这类探索是必要的。

Or like, what are demos of open models working better, whether you have like private data or something, or just far out ideas that could take you off the transformer trajectory. I think that, like, you you still need to be doing these to kind of lead an AI.

Speaker 1

感谢你们为真正开源的AI如此努力。是的，这过程很有趣。

Thank you for working so hard on truly open source AI. Yeah. It's fun.

Speaker 2

我觉得这能让价值观与实际行动更容易保持一致。如果更多事物保持开放，世界会变得更美好，所以很大程度上需要靠我们主动推动。看到OpenAI的行动或承诺，我视之为即将到来的胜利。

I think it's I mean, it makes it easy to align, like, values with what you're doing. Yeah. It's like like, it'd be better for the world if more things are open, and therefore, it's like a lot of it is just willing it into existence. And I take seeing like what OpenAI does is or is saying they're gonna do as like hopefully a win coming soon. Yeah.

Speaker 2

就像深度求索是最意想不到的突破，它引发了连锁反应。是的，我认为这就是前进的道路，我们需要观察后续发展。

Like deep sea was the most unexpected win that made some other dominoes fall. Oh, yeah. I think that is the path forward and see what it takes.

Speaker 1

非常感谢。

Thank you so much.

Speaker 0

感谢你的参与。

Thanks for coming on.