人工分析：独立LLM分析机构——与George Cameron和Micah-Hill Smith对话

本集简介

别错过George的AIE演讲：https://www.youtube.com/watch?v=sRpqPgKeXNk —— 从悉尼地下室启动的副业项目，到成为AI基准测试领域公认的独立黄金标准——开发者、企业及所有主流实验室都依赖Artificial Analysis来应对爆炸式增长的模型、供应商与能力图谱。George Cameron与Micah-Hill Smith用两年时间打造了这个能解答其他平台无法回答问题的平台：哪种模型真正适合你的使用场景？速度与成本的实际权衡点在哪里？所谓"开源"究竟开放到什么程度？我们探讨：起源故事：2023年Micah开发法律AI助手时作为副业创建，2024年1月公开上线，经Swyx转发后爆红为何自主运行评估：实验室采用不同提示方式，精选思维链示例（Google Gemini 1.0 Ultra用32-shot提示在MMLU击败GPT-4），且自报数据存在虚高神秘客政策：使用非自有域名注册账户，匿名运行智能与性能测试，防止实验室在私有端点提供差异化模型盈利模式：企业基准洞察订阅服务（模型部署标准化报告/无服务器vs托管vs芯片租赁），及AI公司定制化私有基准测试（公开排行榜无需付费）智能指数V3：融合10项评估数据集（MMLU/GPQA/智能体基准/长上下文推理）为单一分数，通过重复测试提供95%置信区间遗漏指数（幻觉率）：-100到+100评分（惩罚错误答案，奖励"我不知道"），Claude系列虽非最聪明但幻觉率最低 GDP Val AA：其版OpenAI GDP测试（44项白领任务含电子表格/PDF/PPT），通过Stirrup智能体框架运行（100轮交互/代码执行/网络搜索/文件系统），由Gemini 3 Pro作为LLM裁判评分（经广泛测试，无自偏好偏差）开放指数：从预训练数据透明度/后训练数据/方法论/训练代码/许可协议5维度0-18分评分（AI2 OLMo 2领先，Nous Hermes与NVIDIA Nemotron次之） AI成本微笑曲线：GPT-4级智能成本比发布时低100-1000倍（归功于Amazon Nova等小模型），但智能体工作流的前沿推理模型成本达历史新高（稀疏性/长上下文/多轮交互）稀疏性可能跌破5%：GPT-4.5约5%激活参数，Gemini约3%，遗漏指数准确率与总参数量（非激活量）正相关，预示巨型稀疏模型是未来 token效率vs轮次效率：GPT-5单token成本更高但用更少轮次解决Tau测试（总成本更低），模型正优化token使用时机（5.1 Codex的token分布更集中）智能指数V4即将发布：新增GDP Val AA/Critical Point/幻觉率指标，淘汰部分饱和测试（小模型现可轻松通过类human-eval编程测试） —— Artificial Analysis 官网：https://artificialanalysis.ai George Cameron的X账号：https://x.com/georgecameron Micah-Hill Smith的X账号：https://x.com/micahhsmith 章节 00:00:00 开场：闭环时刻与Artificial Analysis起源 00:01:19 商业模式：独立性与收入来源 00:04:33 起源故事：从法律AI到基准测试需求 00:16:22 AI Grant与迁居旧金山 00:19:21 智能指数演进：V1到V3 00:11:47 基准测试挑战：方差/污染/方法论 00:13:52 神秘客政策与保持独立 00:28:01 新基准：检测幻觉的遗漏指数 00:33:36 Critical Point：硬核物理题与研究级推理 00:23:01 GDP Val AA：真实工作任务的智能体基准 00:50:19 Stirrup智能体框架：开源智能体系统 00:52:43 开放指数：超越许可证的模型透明度衡量 00:58:25 微笑曲线：成本下降与支出上升并存 01:02:32 硬件效率：Blackwell进展与稀疏性极限 01:06:23 推理模型与token效率：光谱初现 01:11:00 多模态基准测试：图像/视频/语音竞技场 01:15:05 展望未来：智能指数V4与方向 01:16:50 终场：对智能永不满足的需求

don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk —- From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really? We discuss: The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) — Artificial Analysis Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\")) George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\")) Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\")) Chapters 00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins 00:01:19 Business Model: Independence and Revenue Streams 00:04:33 Origin Story: From Legal AI to Benchmarking Need 00:16:22 AI Grant and Moving to San Francisco 00:19:21 Intelligence Index Evolution: From V1 to V3 00:11:47 Benchmarking Challenges: Variance, Contamination, and Methodology 00:13:52 Mystery Shopper Policy and Maintaining Independence 00:28:01 New Benchmarks: Omissions Index for Hallucination Detection 00:33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning 00:23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks 00:50:19 Stirrup Agent Harness: Open Source Agentic Framework 00:52:43 Openness Index: Measuring Model Transparency Beyond Licenses 00:58:25 The Smiling Curve: Cost Falling While Spend Rising 01:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits 01:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges 01:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas 01:15:05 Looking Ahead: Intelligence Index V4 and Future Directions 01:16:50 Closing: The Insatiable Demand for Intelligence

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

从某种意义上说，这对我们来说是一个圆满的循环时刻。

This is kind of a full circle moment for us in a way.

Speaker 0

因为，是的。

Because Yeah.

Speaker 0

第一次在播客中提到人工智能分析，就是你和Alessio在《Land of Space》上。

The, like, first time artificial analysis got mentioned on a podcast was you and Alessio on Land of Space.

Speaker 1

太棒了。

Amazing.

Speaker 1

那是在2024年1月。

Which was January 2024.

Speaker 1

我都记不清自己做过那次了，但没错。

I I don't even remember doing that, but yeah.

Speaker 1

那对我影响非常大。

It was it was very influential to me.

Speaker 1

是的。

Yeah.

Speaker 1

我正在查看2024年1月17日或1月16日的AI新闻。

I'm looking at AI News for Jan seventeen or Jan sixteen twenty twenty four.

Speaker 1

我说，这个绝佳的模型与主持人对比网站刚刚上线，然后我放了几张截图。

I said, this gem of a models and host comparison site was just launched, and and then I put in a few screenshots.

Speaker 1

我说，这是一个独立的第三方平台。

And I said, it's an independent third party.

Speaker 1

它清晰地展示了质量与吞吐量之间的权衡。

It clearly outlines the quality versus throughput trade off.

Speaker 0

嗯。

Mhmm.

Speaker 1

它按模型和托管提供商进行了细分。

And it breaks out by model and hosting provider.

Speaker 1

我之前还笑话你错过了烟花秀。

I did give you shit for missing fireworks.

Speaker 1

没有烟花秀，你怎么能做模型基准测试呢？

And how do you have a model benchmarking thing without fireworks?

Speaker 1

但你之前已经整合了Perplexity，我想我们就是在那里开始聊天的。

But you had together, you had Perplexity, and I think we just started chatting there.

Speaker 1

欢迎乔治和米卡来到Lanespace。

Welcome, George and Micah, to Lanespace.

Speaker 1

我一直关注着你们的进展。

You've I've been following your progress.

Speaker 1

恭喜你们取得了辉煌的一年。

Congrats on an amazing year.

Speaker 1

你们真正团结一致，成为了AI领域公认的新兴引领者。

You guys have really come together to be the presumptive new gardener of AI.

Speaker 1

好的。

Okay.

Speaker 1

我该怎么付钱给你们？

How do I pay you?

Speaker 1

我们直接进入正题吧。

Let's get right into that.

Speaker 1

你们是怎么赚钱的？

How do you make money?

Speaker 0

很高兴能谈谈这个。

Well, very happy to talk about that.

Speaker 0

过去几年真是一段漫长的旅程。

So it's been a, like, big journey the last couple of years.

Speaker 0

Artificial Analysis 将在2026年1月迎来两周年，时间过得真快。

Artificial analysis is gonna be two years old in January 2026, which is pretty soon now.

Speaker 0

我们最初当然是免费提供网站，免费分享大量数据，帮助开发者和公司了解人工智能，就模型、供应商以及AI堆栈中的各种技术做出决策。

We first run, like, the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff.

Speaker 0

我们非常致力于持续做这件事。

We're very committed to doing that and tend to keep doing that.

Speaker 0

在此过程中，我们建立了一个运转得相当可持续的业务。

We have, along the way, built a business that is working out pretty sustainably.

Speaker 0

我们现在有二十多名员工。

We've got just over 20 people now.

Speaker 0

有两个主要的客户群体。

And two main customer groups.

Speaker 0

我们希望成为企业获取AI数据和洞察的首选来源。

So we wanna be who enterprise look to for data and insights on AI.

Speaker 0

我们希望帮助他们做出关于模型和技术的决策，以构建各种应用。

So we want to help them with their decisions about models and technologies for building stuff.

Speaker 0

另一方面，我们为整个AI栈中构建AI产品的公司提供私有基准测试服务。

And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff.

Speaker 0

没有人付费访问我们的网站。

So no one pays to be on the website.

Speaker 0

从一开始我们就非常明确这一点，因为如果我们做的不是独立的AI基准测试，那就毫无意义。

We've been very clear about that from the very start, because there's no use doing what we do unless it's independent AI benchmarking.

Speaker 0

是的。

Yeah.

Speaker 0

但事实上，我们的许多内容对构建AI产品的公司非常有用。

But turns out a bunch of our stuff can be pretty useful to companies building AI stuff.

Speaker 1

比如，我是一家财富500强企业，我需要关于客观分析的顾问，于是我联系你们，你们为我生成一份定制报告，然后到我办公室给我做一次研讨会。

And is it like, I'm a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop.

Speaker 1

这种合作模式具体是怎样的？

What what what kind of engagement is that?

Speaker 2

我们提供一种基准测试洞察订阅服务，提供标准化报告，涵盖企业在理解人工智能并选择各种技术时面临的关键主题或挑战。

So we have a benchmarking insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies.

Speaker 2

例如，其中一份报告是模型部署报告。

And so, for instance, one of the report is a model deployment report.

Speaker 2

如何思考在无服务器推理、托管部署解决方案，以及租赁芯片自行运行推理之间做选择，是大型企业面临的一种典型决策，而这种决策很难理清。

How to think about choosing between serverless inference, managed deployment solutions, or leasing chips and running inference yourself is is is an example kind of decision that big enterprises face, and it's hard to hard to reason through.

Speaker 2

因为人工智能对所有人来说都还很新。

Like, this AI stuff is is really new to to everybody.

Speaker 2

因此，我们通过报告和洞察订阅服务，帮助企业应对这些挑战。

And so we try and help with our reports and insight subscription companies navigate that.

Speaker 2

我们还提供定制化的私有基准测试。

We also do custom private benchmarking.

Speaker 2

这与我们公开的公共基准测试非常不同，而且这方面没有商业模型。

And so that's very different from the public benchmarking that we publicize, and there's no commercial model around that.

Speaker 2

对于私有基准测试，我们有时会根据企业的需求创建并运行基准测试。

For private benchmarking, we'll at times create benchmarks, run benchmarks to specs that enterprises want.

Speaker 2

我们也会为那些已经开发出产品的AI公司这样做，通过我们主要通过支持公众公共基准测试所积累的专业知识，帮助他们理解自己所构建的东西。

And we'll also do that sometimes for AI companies who have built things, and we help them understand what they've built with private benchmarking, you know, through the expertise mainly that we've developed through trying to support everybody publicly with our public benchmarks.

Speaker 1

是的。

Yeah.

Speaker 1

让我们谈谈背后的技术栈。

Let's talk about tech stack behind that.

Speaker 1

但好吧。

But okay.

Speaker 1

我要从头说起，回到你们刚开始这个项目的时候。

I'm gonna rewind all the way to when you guys started this project.

Speaker 1

你们当时全都在悉尼。

You were all all the way in Sydney.

Speaker 0

嗯。

Yeah.

Speaker 0

对我来说，是澳大利亚的悉尼。

Well, Sydney, Australia for me.

Speaker 0

乔治来自旧金山，但他也是澳大利亚人，而且他已经搬到这里了。

George was an SF, but he's Australian, but he moved here already.

Speaker 1

嗯。

Yeah.

Speaker 1

我记得我跟你有过一次Zoom通话。

And I remember I had that Zoom call with you.

Speaker 1

当初启动人工智能分析的动因是什么？

What was the impetus for starting artificial analysis in the first place?

Speaker 1

你们一开始是从公开基准测试开始的。

You know, you started with public benchmarks.

Speaker 1

那我们就从这里开始，然后再谈私有的部分。

And so let's start there and we'll go to the private stuff.

Speaker 2

嗯。

Yeah.

Speaker 2

我们为什么不回溯一下，想想当初为什么觉得这事儿是必要的呢？

Why don't we even go back a little bit to, like, why we, you know, thought that it was that it was needed?

Speaker 0

嗯。

Yeah.

Speaker 0

这个故事大概要从2022年、2023年说起。

The story kind of begins, like, in 2022, 2023.

Speaker 0

乔治和我都早就对AI领域很感兴趣了。

Like, both George and I have been into AI stuff for quite a while.

Speaker 0

到了2023年，我正试图打造一个法律领域的AI研究助手。

In 2023, specifically, I was trying to build a legal AI research assistant.

Speaker 0

在当时那个阶段，它实际上表现得相当不错。

So it actually worked pretty well for for for its era, I would say.

Speaker 0

但你越是深入用LLM构建东西，就会发现你做的每一件事最终都变成了一个基准测试问题。

But was finding that the more you go into building something using LLMs, the more each bit of what you're doing ends up being a benchmarking problem.

Speaker 0

所以当时有一个多阶段的算法，试图为每个部分找出最简可行模型，不断优化每一个环节。

So had, like, this multistage algorithm thing, trying to figure out what the minimum viable model for each bit was, trying to optimize every bit of it.

Speaker 0

当你逐步构建时，你会考虑准确性、其他各种指标、性能和成本，但几乎没人会独立评估所有模型，更不用说去分析速度和成本之间的权衡了。

As you build that out, right, like, you're trying to think about accuracy, bunch of other metrics, and performance and cost, and mostly just no one was doing anything to independently evaluate all the models and certainly not to look at the trade offs for speed and cost.

Speaker 0

因此，我们基本上就是想打造一个工具，让开发者能查看所有模型和供应商在这些方面独立测量后的权衡结果。

So we basically set out just to build a thing that developers could look at to see the trade offs between all of those things measured independently across all the models and providers.

Speaker 0

老实说，我们刚开始做这个项目时，可能只是把它当成一个副业。

Honestly, it was probably meant to be a side project when we first started doing it.

Speaker 0

我们并没有聚在一起说：嘿，我们要放弃所有其他事情，把这个当成主业。

Like, we didn't, like, get together and say, like, hey.

Speaker 0

我们并没有聚在一起说：嘿，我们要放弃所有其他事情，把这个当成主业。

Like, we're gonna stop working on all that stuff, like, this gonna be our main thing.

Speaker 1

我第一次联系你的时候，我觉得你还没决定要创办公司。

When I first called you, I think you you hadn't decided on starting a company yet.

Speaker 0

这确实是真的。

That's actually true.

Speaker 0

我都不觉得这是保罗的主意，我和乔治都辞了工作。

I don't even think it would Paul's like like, George and I quit his job.

Speaker 0

我还没辞掉那个法律AI项目的工作。

I hadn't quit working on legal AI thing.

Speaker 0

这真的是个副项目。

Like, it it was genuinely a side project.

Speaker 2

是的。

Yeah.

Speaker 2

我们构建它是因为作为这个领域里的开发者，我们自己需要它，于是心想，其他人可能也会觉得有用，所以我们就买了个域名，链接到我们已有的Vercel部署上。

We built it because we needed it as people building in the space and thought, oh, other people might find it useful too, so we'll buy domain and link it to the Vercel deployment that that that we had.

Speaker 2

是的。

Yeah.

Speaker 2

然后发了条推文宣传它。

And and and and and and tweet about it.

Speaker 2

但很快，它就开始引起关注了。

And but very quickly, it started getting attention.

Speaker 2

感谢SWIX，我认为你最初进行了转发并推广了我们发布的这个项目。

Thank you, SWIX, for, I think, doing an initial retweet and spotlighting it there, this project that that we released.

Speaker 2

但很快，它对其他人来说变得有用，而且随着发布的模型数量加速增长，它的价值也迅速提升。

And then very quickly, though, it it was useful to others, but very quickly, it became more useful as the number of models released accelerated.

Speaker 2

我们有了Mixtral 8x7B，它是一个关键的

We had Mixtrel eight times seven b, and it was a key

Speaker 1

这真是个有趣的模型。

That's a fun one.

Speaker 2

是的。

Yeah.

Speaker 2

这是一个真正改变格局的开源模型，让人们开始关注其他无服务器推理提供商，并思考速度和成本问题。

Like, a a open source model that really changed the landscape and opened up people's eyes to other serverless inference providers and thinking about speed, thinking about cost.

Speaker 2

因此，它很快就变得更加有用。

And so it became more useful quite quickly.

Speaker 1

是的。

Yeah.

Speaker 1

我喜欢和你们这样身处生态系统各处的人交谈，因为我对人们的需求有一些理论，但你们有数据。

What I love talking to to people like you who sit across the ecosystem is, well, I have theories about what people want, but you have data.

Speaker 1

这显然更相关。

And that's obviously more more relevant.

Speaker 1

但我想再多聊聊这个项目的起源。

But I wanna say on the origin story a little bit more.

Speaker 1

你们刚开始的时候，我认为当时的现状是，每篇论文都会发布自己的数据，并与竞争对手的数据进行对比，仅此而已。

When you started out, I would say, I think the the status quo at the time was every paper would come out and they would report their numbers versus competitor numbers, and that's basically it.

Speaker 1

我记得我做了大量的基础工作。

And I remember I did the legwork.

Speaker 1

我想每个人都有一个类似Excel或Google表格的文件，里面只是把每篇论文的数据复制粘贴上去，但有时数据对不上，因为它们是独立运行的。

I think I think everyone has some version of Excel sheet or Google sheet where you just like copy and paste the numbers from every paper and just post it up there, and then sometimes they don't line up because they're independently run.

Speaker 1

所以你们的数据看起来会更好，而你们复现的其他人数据则显得更差，原因可能是你们没有正确地复现他们的模型，或者各种借口。

And so your numbers are gonna look better than the your reproductions of other people's numbers is gonna look worse because you don't hold their models correctly or whatever whatever the the excuse is.

Speaker 1

我认为斯坦福的Helm项目，也就是Percy Liang的项目，也包含了一些这样的数据。

I think then Stanford Helm, Percy Liang's project will also have some some of these numbers.

Speaker 1

我不知道是否还有其他可以引用的来源。

And I don't know if there's any other source that you can cite.

Speaker 1

如果我在你们开始的同时启动人工智能分析，我会使用 Illuther AI 的评估框架工具。

The way that if I were to start artificial analysis at the same time you guys started, I would have used Illuther AI's eval framework harness.

Speaker 0

没错。

Yep.

Speaker 0

没错。

Yep.

Speaker 0

那确实很酷。

That was some cool stuff.

Speaker 0

归根结底，进行这些评估就像做简单的问答评估：你只是问一系列问题，然后检查答案是否正确，这本不该那么复杂。

At the end of the day, right, running these evals, it's like if it's a simple q and a eval, all you're doing is asking a list of questions and checking if the answers are right, which shouldn't be that crazy.

Speaker 0

但事实证明，你需要控制的因素多得惊人。

But it turns out there are an enormous number of things that you've got to control for.

Speaker 0

我的意思是，当我们刚开始做这个网站时，其中一个我们意识到必须自己运行评估、而不能直接采用实验室结果的原因，就是他们对模型的提示方式各不相同。

And I mean, back when we started the website, like one of the reasons why we realized that we had to run the evals ourselves and couldn't just take results from the labs was just that they would all prompt the models differently.

Speaker 0

当你在几分之差上竞争时，你就可以

And when you're competing over a few points, then you can

Speaker 1

把答案输入到模型中

You can put the answer into the model

Speaker 0

进入问题本身。

into the problem.

Speaker 0

在极端情况下。

That in the extreme.

Speaker 0

比如，当年谷歌的Gemini 1.0 Ultra需要一个数字来证明它优于GPT-4，于是他们构建了我认为从未公开的32个思维链示例，覆盖MMLU的每个主题，以获取分数。

And like, you get crazy cases like back when Googled the Gemini one point o Ultra and needed a number that would say it was better than GPT-four and like, constructed I think never published, like, chain of thought examples, 32 of them in every topic in MMLU to run it to get the score.

Speaker 0

像这样的事情太多了，你

Like, there are so many things that you

Speaker 1

他们从未发布Ultra版本。

They never shipped Ultra.

Speaker 1

对吧？

Right?

Speaker 1

这个，这个，是的。

This one this one Yeah.

Speaker 0

我的意思是，我相信它是存在的。

I mean, I'm sure it existed.

Speaker 0

但没错。

But yeah.

Speaker 0

所以我们非常确定，必须自己运行这些测试，并且以相同的方式在所有模型上运行。

So we were pretty sure that we needed to run them ourselves and just run them in the same way across all the models.

Speaker 0

是的。

Yeah.

Speaker 0

而且从一开始我们就非常确定，不能孤立地看待这些结果。

And we were we were also dead certain from the start that you couldn't look at those in isolation.

Speaker 0

你需要将它们与成本和性能数据结合起来看。

You needed to look at them alongside the cost and performance stuff.

Speaker 1

是的。

Yeah.

Speaker 1

好的。

Okay.

Speaker 1

有几个技术问题。

A couple technical questions.

Speaker 1

我的意思是，我也想过这个问题，但因为成本问题没做。

I mean, so obviously, I also thought about this and I didn't do it because of cost.

Speaker 1

你们有没有担心过成本？

Did you did you not worry about cost?

Speaker 1

你们已经获得资金支持了吗？

Were you funded already?

Speaker 1

显然没有，但你知道的。

Clearly not, but, you know

Speaker 0

没有。

No.

Speaker 0

我们当然在一开始并没有资金。

We we well, we definitely weren't at the start.

Speaker 0

所以，我的意思是，我们个人在初期承担了费用

So, I mean, we're paying for it personally at

Speaker 1

初期。

the start.

Speaker 1

这里涉及很多钱。

There's there's a lot of money here.

Speaker 0

好吧，几年前这些数字可没这么糟糕。

Well, the numbers weren't nearly as bad a couple of years ago.

Speaker 0

所以我们确实产生了一些成本，但我们在进行基准测试时的花费大概只有几百美元的量级。

So we, like, certainly, like, incurred some costs, but we were probably in the order of, like, hundreds of dollars of spend across the benchmarking that we're doing.

Speaker 1

好的。

Speaker 1

所以没什么。

So nothing.

Speaker 1

是的。

Yeah.

Speaker 0

当时，还算可以。

It was, like, kind of fine.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

几天时间，由于各种原因，花费大幅增加了。

Days, that's gone up an enormous amount for a bunch of reasons that we can talk about.

Speaker 0

但没错，也没那么糟。

But, yeah, it it wasn't that bad.

Speaker 0

因为如果你还记得的话，我们当时处理的模型数量很少，而且我们想用来评估它们的复杂性也低得多。

Because if you can also remember that, like, the number of models that we were dealing with was hardly any, and the complexity of the stuff that we wanted to do to evaluate them was a lot less.

Speaker 0

我们只是问一些问答类型的问题。

Like, we were just asking some q and a type questions.

Speaker 0

最初在很多评估中，我们只是直接给出答案，而不让模型进行思考。

And then one specific thing was for a lot of evals initially, we were just, like, something an answer directly without letting the models think.

Speaker 0

我们最初甚至没有使用链式错误的方法。

We weren't even doing chain of false stuff initially.

Speaker 0

而这是最初获得一些结果的最有效方式。

And that was the most useful way to get some results initially.

Speaker 1

是的。

Yeah.

Speaker 1

对于没有做过这项工作的人而言，解析响应本身就是一个大问题。

And so for if for people who haven't done this work, literally parsing the responses is a whole thing.

Speaker 1

对吧？

Right?

Speaker 1

因为有时模型会以任何它们觉得合适的方式作答，有时它们确实有正确答案，但只是返回了错误的格式。

Like, because sometimes the models the models can answer any way they fit feel fits, and sometimes they actually do have the right answer, but they just return the wrong format.

Speaker 1

除非你把这种情况考虑进你的解析器中，否则它们会得零分，而这会增加更多工作量。

And they will get a zero for that unless you work it into your parser, and that involves more work.

Speaker 1

是的。

Yeah.

Speaker 1

所以，这里有一个悬而未决的问题：你是否应该因为模型没有遵循格式指令而扣分？

And so there I mean, but there's an open question whether you should give it points for not following your instructions on the format.

Speaker 0

这取决于你关注的是什么。

So It depends what you're looking at.

Speaker 1

对吧？

Right?

Speaker 1

嗯。

Yeah.

Speaker 1

因为你可以

Because you can

Speaker 0

如果你是想看看模型能否解决某种特定的推理问题，而不希望同时测试它对答案格式的处理能力，那么你可能希望使用一种LMS答案提取方法，以确保无论模型如何作答，你都能提取出答案。

if you're trying to see whether or not it can solve a particular type of reasoning problem, and you don't wanna test it on its ability to do answer formatting at the same time, then you might want to use an LMS answer extractor approach to make sure that you get the answer out no matter how it answered.

Speaker 0

但如今，这个问题已经不那么严重了。

But these days, it's mostly less of a problem.

Speaker 0

比如，如果你给模型明确的指令，并提供答案格式的示例，它就能按照你的格式作答，然后你就可以用简单的正则表达式来处理。

Like, if you instruct a model and give it examples of what the answer should look like, it can get the answers in your format, and then you can do, a simple regex.

Speaker 1

嗯。

Yeah.

Speaker 1

嗯。

Yeah.

Speaker 1

然后还有一些其他问题，比如在多项选择题中，有时会偏向第一个答案，所以你必须对选项进行随机化。

And then there's other questions around I guess sometimes that if you have a multiple choice question, sometimes there's a bias towards the first answer, so you have to randomize the responses.

Speaker 1

所有这些细节，当你深入研究基准测试时，你会觉得：我不明白为什么有人会相信这些数据。

All these nuances, you'll like once you dig into benchmarks, you're like, I don't know how anyone believes the numbers on all these things.

Speaker 1

因为这简直就像黑魔法。

Because it's so it's so dark magic.

Speaker 0

你还会遇到不同基准测试之间的差异程度不同。

You've also got like the different degrees of variance and different benchmarks.

Speaker 0

对吧？

Right?

Speaker 0

所以，如果你在一个现代推理模型上运行四道题的多项选择题，使用实验室为其模型建议的温度参数，仅进行一次运行且题目数量较少时，你看到的方差会非常大。

So if if you if you run four question multi choice on a modern reasoning model at the temperature suggested by the labs for their own models, the variance that you can see on a four question multi choice eval is pretty enormous if you only do a single run of it and it has a small number of questions especially.

Speaker 0

所以，我们做的一件事是在开发新的评估方法并升级我们的智能指数以纳入新内容时，运行大量的所有评估，以便确定合适的重复次数，从而获得我们所满意的95%置信区间，这样当我们整合这些数据时，就能对智能指数的准确性有信心，至少达到±1的95%置信水平。

So, like, one of the things that we do is run an enormous number of all of our evals when we're developing new ones and doing upgrades to our intelligence index to bring in new things so that we can dial in the right number of repeats so that we can get to the 95% confident confidence intervals that we're comfortable with so that when we pull that together, we can be confident in intelligence index to at least as tight as like a plus or minus one at a 95% confidence.

Speaker 1

没错。

Yep.

Speaker 1

而且这直接按比例增加了成本。

And again, that just adds a straight multiple to the cost.

Speaker 1

哦，对了。

And Oh, Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

对吧？

Right?

Speaker 0

所以，这也是过去几年成本增长远超线性增长的众多原因之一。

So that's one of many reasons that cost has gone up a lot more than linearly over the last couple of years.

Speaker 2

我们在网站上公布了运行人工智能分析智能指数的成本。

We we report a cost to run the artificial analysis intelligence index on our website.

Speaker 2

而目前，这假设只重复一次。

And currently, that's assuming one repeat Okay.

Speaker 2

是的。

Yeah.

Speaker 2

关于我们的报告方式，因为我们希望稍微说明一下指数的权重问题。

In terms of how we report it because we wanna reflect a bit about the weighting of the index.

Speaker 2

但我们的实际成本远高于我们在这里报告的数字，因为包含了重复计算。

But but our cost is actually a lot higher than what we report there because of the repeats.

Speaker 1

是的。

Yeah.

Speaker 1

是的。

Yeah.

Speaker 1

是的。

Yeah.

Speaker 1

而且这可能是真的，但我只是确认一下，你们和实验室之间没有特殊的优惠协议吧。

And probably this is true, but just checking, they you don't have any special deals with the labs.

Speaker 1

他们并不给予折扣。

They don't they don't discount it.

Speaker 1

你只能自费或使用你的客户资金支付。

You you just pay out of pocket or out of your your sort of customer funds.

Speaker 0

哦，情况比较复杂。

Oh, there there is a mix.

Speaker 0

所以

Speaker 1

所以问题在于，有时他们可能会给你一个特殊端点，这是

we So so the the the issue is that sometimes they may give you a special endpoint, which is

Speaker 0

100%。

100%.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 0

没错。

Exactly.

Speaker 0

所以我们专注于一切我们所做的工作，确保拥有最优秀的独立指标，并保证任何人都无法以任何方式操纵它们。

So we laser focused, like, on everything we do on having the best independent metrics and making sure that no one can manipulate them in any way.

Speaker 0

在过去几年里，我们开发了相当多的流程，以确保这一点得以实现，比如你提到的这种情况：当我们与某个实验室合作时，如果他们给我们一个私有端点来评估模型，那么这个黑箱背后的内容很可能与他们在公共端点上提供的内容并不相同。

There are quite a lot of processes we've developed over the last couple of years to make that true for, like, the one you bring up, like, right here of the fact that if we're working with a lab, if they're giving us a private endpoint to evaluate a model, that it is totally possible that what's sitting behind that black box is not the same as they serve on a public endpoint.

Speaker 0

我们对此非常清楚。

We're very aware of that.

Speaker 0

我们有一个所谓的‘神秘顾客’政策，并且我们对所有合作的实验室都完全透明：我们会使用非我们自己的域名注册账户，同时进行智能评估和性能基准测试，是的。

We have what we call a mystery shopper policy, and so and we're totally transparent with all the labs we work with about this, that we will register accounts not on our own domain and run both intelligence evals and performance benchmarks Yeah.

Speaker 1

这就是工作内容。

That's the job.

Speaker 0

在他们不知情的情况下进行，但迄今为止没有人对此提出过任何问题。

Without them being unidentified, and no one's ever had a problem with that.

Speaker 0

因为事实上，行业内一个相当好的因素是，他们都希望相信，没有任何竞争对手能够操纵我们的做法。

Because, like, a thing that turns out to actually be quite a good factor in the industry is that they all want to believe that none of their competitors could manipulate what we're doing either.

Speaker 1

确实如此。

That's true.

Speaker 1

我从来没想过这一点。

I never thought about that.

Speaker 1

我以前在数据库行业工作，那里在基准测试方面有很多花招。

I've been in a database data industry prior, and there's a lot of shenanigans around benchmarking.

Speaker 1

对吧？

Right?

Speaker 1

所以我只是在心里一一梳理。

So I'm just kinda going through the mental laundry list.

Speaker 1

在这个花招类别里，我还漏掉了什么吗？

Did I miss anything else in in that in this category of shenanigans?

Speaker 1

入侵。

Intrusion.

Speaker 0

我的意思是k。

I mean k.

Speaker 0

最大的一个，我觉得应该单独提出来，其实更多是一个概念性的问题，而不是直接的作弊行为。

The the the the biggest one, like, that I'll break off, like, is more of a conceptual one actually than, like, direct shenanigans.

Speaker 0

被衡量的东西会成为人们努力瞄准的目标。

It's that the things that get measured become things that get targeted by

Speaker 1

努力做好工作。

the Good hard work.

Speaker 0

他们正在努力构建。

They're trying to build.

Speaker 0

对吧？

Right?

Speaker 0

没错。

Exactly.

Speaker 0

所以这并不意味着我们应该称之为作弊行为。

So that doesn't mean anything that we should really call shenanigans.

Speaker 0

我不是在说在测试集上训练。

Like, I'm not talking about training on test set.

Speaker 0

但如果你知道你将在某个特定方面表现优异，作为研究人员，你可以做很多事来提升这方面的能力，这些事最好也能对你所构建的产品的更广泛用户用途有所帮助，但并不一定如此。

But if you know that you're gonna be great at another particular thing, if you're a researcher, there are a whole bunch of things that you can do to try to get better at that thing that preferably are going to be helpful for a wider range of how actual users wanna use the thing that you're building, but will not necessarily do that.

Speaker 0

例如，现在的模型在解答竞赛数学题方面非常出色。

So for instance, the models are exceptional now at answering competition maths problems.

Speaker 0

这种类型的推理和工作与我们如何使用现代编码代理等确实有一定关联，但显然不是一一对应的。

There is some relevance of that type of reasoning, that type of work to, like, how we might use modern coding agents and stuff, but it's clearly not one for one.

Speaker 0

因此，我们必须意识到，一旦某个评估指标成为所有人关注的焦点，分数可能会提升，但这并不反映模型整体通用智能的真正进步。

So the thing that we have to be aware of is that once an eval becomes the thing that everyone's looking at, the scores can get better on it without there being a reflection of overall generalized intelligence of these models getting better.

Speaker 0

过去几年一直如此。

That has been true for the last couple of years.

Speaker 0

未来几年也会如此。

It'll be true for the next couple of years.

Speaker 0

要克服这个问题没有万能钥匙，唯一的办法是不断创造新东西，以保持相关性，并衡量对真实用户最重要的能力。

There's no silver bullet to defeat that other than building new stuff to stay relevant and measure the capabilities that matter most to real users.

Speaker 1

是的。

Yeah.

Speaker 1

我们也会聊聊你们正在开发的一些新东西，这很酷。

And we'll we'll cover we'll cover some of the new stuff that you guys are building as well, which is cool.

Speaker 1

你们以前只是运行别人的评估，但现在你们开始推出自己的了。

Like, you you used to just run other people's evals, but now you're coming out with your own.

Speaker 1

我认为，一旦你站在前沿，这确实是必经之路。

And I think, obviously, that is a a necessary path once you're at the frontier.

Speaker 1

你们已经把现有的所有一对一评估都用完了。

You've exhausted all the existing one on ones.

Speaker 1

我想接下来我要跟你们谈的下一个关键点是AI Grant，你们决定加入并搬到这里。

I think the next point in history that I have for you is AI Grant, that you guys decided to to join and and move here.

Speaker 1

那感觉怎么样？

What's what was it like?

Speaker 1

我觉得你们应该是第二期吧？

I think you're you were in, like, batch two?

Speaker 0

第四批。

Batch four.

Speaker 0

第四批？

Batch four?

Speaker 0

好的。

Okay.

Speaker 0

我的意思是，很棒。

I mean, was great.

Speaker 0

纳特和丹尼尔显然非常出色，我们与他们一起参与AI资助计划的公司群体也非常出色。

Nat and Daniel are obviously great, and it's a really cool group of companies that we were in AI grant alongside.

Speaker 0

能请到纳特和丹尼尔加入，真的非常好。

It was really great to get Nat and Daniel on board.

Speaker 0

显然，他们在该领域已经做了大量出色的工作，合作过许多领先公司，并且与我们所追求的使命高度一致。

Obviously, they've done a whole lot of great work in the space with a lot of leading companies and were extremely aligned with the mission of what we were trying to do.

Speaker 0

我们并不像他们投资的许多其他AI初创公司那样典型，但他们非常认同我们想要实现的使命。

Like, we're not quite typical of, like, a lot of the other AI startups that they've invested in, and they were very much here for the mission of what we wanna do.

Speaker 1

他们有没有给出任何对你产生重大影响的建议？

Did they say any advice that really affected you in some way?

Speaker 1

或者，有没有哪个活动特别有影响力？

Or, like, were one of the events very impactful?

Speaker 0

这是个有趣的问题。

That's an interesting question.

Speaker 0

我的意思是，我依然清晰地记得那些在AI Grant火炉聊天中来过的演讲者。

I I mean, I remember fondly a bunch of the speakers who came into fireside chats at AI Grant.

Speaker 1

那也是一份相当了不起的名单。

Which is also like a crazy list.

Speaker 0

是的。

Yeah.

Speaker 0

哦，是的。

Oh, yeah.

Speaker 0

是的。

Yeah.

Speaker 2

是的。

Yeah.

Speaker 2

我和纳特、丹尼尔交谈时，感受到一些关于创业过程中所面临挑战的内容，尤其是那些没有明确答案的问题，以及如何系统性地应对这些难题、做出艰难的决定。

I I I there was something about, you know, speaking to Nat and Daniel about the challenges of of of working through a startup and just working through the questions that don't have, like, clear answers and how to work through those kind of methodically and just, like, work through the hard decisions.

Speaker 2

他们在我们构建 Artificial Analysis 的过程中，给予了我们很好的指导。

And they're being great mentors to to us as we build artificial analysis.

Speaker 2

对我们来说，另一个好处是，同批的其他公司以及 AI Grant 中的其他公司，正在推动当前 AI 能力的边界。

Another benefit for us was that other companies in the batch and other companies in AI grant are pushing the capabilities of what AI can do at this time.

Speaker 2

因此，与他们保持联系，确保 Artificial Analysis 对他们有用，这对我们支持自身发展、思考如何进一步完善 Artificial Analysis 以持续服务于那些正在基于 AI 进行创新的人，起到了极大的帮助。

And so being in contact with them, making sure that artificial analysis is useful to them, has been fantastic for supporting us and working out how how should we build out artificial analysis to continue to being useful to those, like, you know, building on AI.

Speaker 1

我认为，某种程度上我对这一点持矛盾态度，因为你的目标受众某种程度上并不是那些显然站在前沿的 AI Grant 参与者。

I think, to some extent, I'm mixed opinion on that one because, to some extent, your target audience is not people in AI grants who are obviously at the frontier.

Speaker 0

是的。

Yeah.

Speaker 0

在某种程度上。

To some to some extent.

Speaker 0

但AI资助公司所做的很多工作，就是将实验室中出现的能力应用到整个应用构建栈中，推动这些能力的极限，这实际上使其中一些公司成为人工分析的典型重度用户。

But then so a lot of what the AI grant companies are doing is, taking capabilities coming out of the labs and trying to push the limits of what they can do across the entire stack for building great applications, which actually makes some of them pretty archetypical power users of artificial analysis.

Speaker 0

他们对我们做得好与不好的地方，以及希望我们下一步推出的功能，有着最强烈的看法。

Some of the people with the strongest opinions about what we're doing well and what we're not doing well and what they wanna see next from us.

Speaker 0

因为当你现在构建任何一种AI应用时，很可能会使用多种不同的模型。

Because when you're building any kind of AI application now, chances are you're using a whole bunch of different models.

Speaker 0

你可能会相当频繁地为应用的不同部分切换模型，以优化准确性，并获得更好的速度和成本表现。

You're maybe switching reasonably frequently for different models and different parts of your application to optimize what you're able to to do with them at an accuracy level and to get better speed and cost characteristics.

Speaker 0

所以对他们中的许多人来说，不是。

So for many of them no.

Speaker 0

他们并不是我们的商业客户。

They're, like, not commercial customers of ours.

Speaker 0

我们并没有对网站上的所有数据收费，但他们绝对是我们的重度用户。

Like, we don't charge for all that data on the website, but they are absolutely some of our power users.

Speaker 1

那我们来谈谈评估本身吧。

So let's talk about just the the the evals as well.

Speaker 1

对吧？

Right?

Speaker 1

比如，你们从一般的MMORU和GPQA这些东西开始。

Like, you see start out from the the general, like, MMORU and and GPQA stuff.

Speaker 1

接下来呢？

What's next?

Speaker 1

你们是怎么逐步构建出整体的指标的？V1版本是什么样的，你们又是怎么演进的？

How do you how do you sort of build out to the overall index, what was in v one, and how did you evolve it?

Speaker 0

好的。

Okay.

Speaker 0

首先，简单介绍一下背景：我们说的是人工智能分析指数，这是我们综合了目前10个不同的评估数据集得出的合成指标，用以提供一个我们相当有信心的、衡量模型智能水平的最佳单一数值。

So first, just, like, background, like, we're talking about the artificial analysis intelligence index, which is our synthesis metric that we pull together currently from 10 different eval datasets to give what we're pretty confident is the best single number to look at for how smart the models are.

Speaker 0

当然，它并不能说明全部情况。

Obviously, doesn't tell the whole story.

Speaker 0

这就是为什么我们发布了整个网站上的所有图表，以便深入每一个部分，查看各种权衡，但这个数值仍是目前最好的单一指标。

That's why we publish the whole website of all the charts to dive into every part of it and look at the trade offs, but best single number.

Speaker 0

所以目前，它包含了大量对行业非常重要的问答类数据集，比如你刚才提到的几个。

So right now, it's gotten a bunch of q and a type datasets that have been very important to the industry, like a couple that you just mentioned.

Speaker 0

它还包含了一些代理型数据集。

It's also got a couple of agentic datasets.

Speaker 0

它还包括我们自己的长上下文推理数据集以及其他一些以用例为导向的内容。

It's got our own long context reasoning dataset and some other use case focused stuff.

Speaker 0

随着时间推移，我们最关注的、对AI能力越来越重要的方面，将首先围绕代理能力展开。

As time goes on, the things that we're most interested in that are gonna be important to the capabilities that are becoming more important for AI, what developers are caring about, are gonna be first around agentic capabilities.

Speaker 0

所以，不出所料，我们都非常喜爱我们的编码代理，以及模型在这类任务上的表现，同时我们也非常重视在其他类型工作中的类似表现。

So surprise, surprise, we're all loving our coding agents and how the model is gonna perform like that, and then do similar things for different types of work are really important to us.

Speaker 0

将这些能力与具有经济价值的用例联系起来，对我们来说至关重要。

The linking to use cases, to economically valuable use cases are extremely important to us.

Speaker 0

此外，还有一些模型仍然难以很好处理的方面，比如在长上下文中的表现，这些不会消失，而是我们仍需持续评估的具体能力和用例。

And then we've got some of these things that the model still struggle with, like working really well over long contexts that are not gonna go away as specific capabilities and use cases that we need to keep evaluating.

Speaker 1

嗯。

Mhmm.

Speaker 1

但我认为我之前想说的是，v1 和 v2 之间的区别，以及它们是如何随时间演变的。

And the but I guess one thing I was driving was, like, the v one versus the v two and how that evolves over time.

Speaker 0

比如，我们是如何逐步调整指标，走到今天的。

Like, how like, how we've changed the index to where we are.

Speaker 1

我认为这反映了整个行业的变化。

And I think that reflects on the well, the change in the industry.

Speaker 1

对吧？

Right?

Speaker 1

没错。

Yep.

Speaker 1

这真是讲述这个故事的不错方式。

So that's a nice way to tell that story.

Speaker 0

如今，v1 几乎已经被所有发布的新模型完全超越了，因为像在 HumanEval 中编写 Python 函数这样的任务现在已经非常简单了。

Well, v one would be completely saturated right now by almost every model coming out because doing things like writing the Python functions in human eval is now pretty trivial.

Speaker 0

其实很容易忘记，过去两年我们已经取得了多大的进步。

It's easy to forget actually, I think, how much progress has been made in the last two years.

Speaker 0

我们显然一直在玩一个游戏，就是今天版本和上周版本、再上周版本之间的对比，以及当前前沿模型之间谁在这一周内拥有最好的、小于100亿参数的模型这类细微变化。

Like, we we obviously play the game constantly of, like, the today's version versus last week's version and the week before, and all of the small changes in the horse race between the current frontier and the who has the best, like, small than 10 b model, like, right now this week.

Speaker 0

对吧？

Right?

Speaker 0

这对许多开发者，尤其是旧金山这座城市里的人而言非常重要。

And that's very important to a lot of developers and people in especially in this particular city of San Francisco.

Speaker 0

但当你把时间拉回到几年前，回看当时我们评估模型的方法，如今即使是相当小的模型也能100%解决。

But when you zoom out a couple of years ago, literally, of what we were doing to evaluate the models then would all be 100% solved by even pretty small models today.

Speaker 0

这正是推动每一层级智能成本下降的关键因素之一。

And that's been one of the key things, by the way, that's driven down the cost of intelligence at every tier of intelligence.

Speaker 0

我们稍后可以再深入聊聊。

We can talk about more in a bit.

Speaker 0

所以，从v1到v2再到v3，我们提高了难度。

So v one, v two, v three, we made things harder.

Speaker 0

我们覆盖了更广泛的应用场景，并试图更贴近开发者真正关心的问题，而不是像MMLU和GPQA那样仅限于问答类任务。

We covered a wider range of use cases, and we tried to get closer to things developers care about as opposed to, like, just the q and a type stuff that MMLU and GPQA represented.

Speaker 1

嗯。

Yeah.

Speaker 1

我不知道你是否有什么要补充的。

I don't know if you have anything to add there.

Speaker 1

或者我们可以直接展示一下这个基准测试，点击浏览并提问关于它的内容。

Or we could just go right into showing people the benchmark and, like, clicking around and ask asking questions about it.

Speaker 1

嗯。

Yeah.

Speaker 2

我们开始吧。

Let's do it.

Speaker 2

好的。

Okay.

Speaker 0

这将是一个很好的方式，来聊聊我们最近推出的一些新功能。

This would be a pretty good way to chat about a few of the new things we've launched recently.

Speaker 2

嗯。

Yeah.

Speaker 2

而且我认为我们需要思考一下我们想要推进的方向，我们希望推动基准测试。

And I think a little bit about the direction that we wanna take it, and we wanna push benchmarking.

Speaker 2

目前，智能指数和评估主要关注原始智能，但我们希望拓宽对智能的理解，我们可以讨论一下，我们新构建并合作开发的一些评估侧重于幻觉等话题。

Currently, the intelligence index and and and evals focus a lot on kinda raw intelligence, but we kinda want to diversify how we think about intelligence, and we can we can talk about it, but kinda new evals that we've kinda built and partnered on focus on topics like hallucination.

Speaker 2

我们还有很多主题是当前评估集没有涵盖但应该包含的。

And we've got a lot of topics that I think are not covered by the current eval set that should be.

Speaker 2

因此，我们希望把这些内容提出来。

And and so we wanna bring that forth.

Speaker 2

但在我们深入之前。

But before we get into that.

Speaker 1

所以对于听众来说，这里有一个时间戳。

And so so for listeners, just there's a time stamp.

Speaker 1

目前，排名第一的是 Gemini 3 Pro High，其次是 Cloud Opus，得分为 70。

Right now, number one is Gemini three pro high, then followed by Cloud Opus at seventy.

Speaker 1

只有 5.1 高，你还没有 5.2，而 Kimi K2 依然表现不错，真厉害。

Just five point one high, you don't have five point two yet, and Kimi k two thinking, wow, still hanging in there.

展开剩余字幕（还有 480 条）

Speaker 1

所以这些是前四名。

So those those are the top four.

Speaker 0

这会让这个播客很快过时。

That will date this podcast quickly.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.

Speaker 1

我的意思是，我非常喜欢。

I mean, I love it.

Speaker 1

我非常喜欢。

I love it.

Speaker 1

不。

No.

Speaker 1

不。

No.

Speaker 0

我说这是最好的

I'm I'm saying it's the best

Speaker 1

明年这个时候再说，真可爱。

time next year and go, how cute.

Speaker 2

嗯。

Yep.

Speaker 2

完全同意。

Totally.

Speaker 2

快速看一下这个没问题。

A quick view of that is okay.

Speaker 2

有很多。

There's a lot.

Speaker 1

我喜欢这个图表。

I love this chart.

Speaker 0

这个真是太喜欢了。

This is this is such a favorite.

Speaker 0

对吧？

Right?

Speaker 1

是的。

Yeah.

Speaker 0

而且，乔治或我在各种会议上的几乎每一场演讲，我们都会首先展示这张图，来讨论我们当前所处的历史时刻。

And almost every talk that George or I give at conferences and stuff, like, we always put this one up first to just talk about situating where we are in this moment in history.

Speaker 0

我认为，这正是我之前所说的‘拉远视角’并记住我们已经取得的巨大进步的视觉版本。

This, I think, is the the visual version of what I was saying before about the zooming out and remembering how much progress there's been.

Speaker 0

如果我们回溯到一年多前，在o one和Claude Sonnet 3.5出现之前，我们还没有推理模型或编码代理这样的概念，当时的局面完全不同。

If we go back to just over a year ago, before o one, before Claude Sonnet 3.5, we didn't have reasoning models or coding agents as a thing, and the game was very, very different.

Speaker 0

如果我们再往前追溯一点，那时我们正处于一个时代——当你看这张图表时，OpenAI在一年多的时间里都是不可撼动的。

If we go back even a little bit before then, we're in the era where when you look at this chart, like, OpenAI was an untouchable for well over a year.

Speaker 0

而且，你应该还记得那段时期，当时人们对人工智能是否具有竞争力还存在很多疑问。

And, I mean, you would remember that time period well of, like, there being very open questions about whether or not AI was going to be competitive.

Speaker 0

嗯。

Mhmm.

Speaker 0

就像彻底结束一样。

Like, full stop.

Speaker 0

无论是OpenAI会完全领先，还是只有少数几家前沿实验室能做点什么，其他人只能消费它们的API。

Whether or not OpenAI would just run away with it, whether we would have the a few frontier labs and no one else would really be able to do anything other than consume their APIs.

Speaker 0

总的来说，我对最终形成的这个世界感到非常满意，这个世界是

I am quite happy overall that the world that we have ended up in is one where

Speaker 1

多模态的。

Multimodal.

Speaker 1

是的。

Yeah.

Speaker 0

当然。

Absolutely.

Speaker 0

而且在过去两年里，每个季度都变得更加竞争激烈。

And strictly more competitive every quarter over the last two years.

Speaker 1

是的。

Yeah.

Speaker 1

今年简直疯狂。

This year has been insane.

Speaker 2

是啊。

Yeah.

Speaker 2

你能看出来。

You can see it.

Speaker 2

这张包含了所有数据的图表目前很难阅读。

This chart with everything added is hard hard to read currently.

Speaker 2

是啊。

Yeah.

Speaker 2

上面的点太多了，但我认为这稍微反映了我们感受到的情况。

There's so many dots on it, but I think reflects a little bit, you know, what what we felt like.

Speaker 2

有多疯狂。

How crazy it's been.

Speaker 1

为什么默认是14？

Why 14 as the default?

Speaker 1

这是手动选择的吗？

Is that a manual choice?

Speaker 1

因为你在里面加入了ServiceNow，这些名字都比较不传统。

Because you got ServiceNow in there that are, you know, less less traditional names.

Speaker 2

是的。

Yeah.

Speaker 2

默认情况下，我们在图表和智能指数中突出显示的是这些模型，好的。

It's models that we're kinda highlighting by default in our charts, in our intelligence index Okay.

Speaker 2

这是在

Is where the

Speaker 1

你只是手动整理了一份清单。

You just have a manually curated list of stuff.

Speaker 2

是的。

Yeah.

Speaker 2

没错。

That's right.

Speaker 2

但其实并不是每个使用人工智能分析的用户都知道，你可以自定义我们的图表并选择哪些规则

But something that I actually don't think every artificial analysis user knows is that you can customize our charts and and choose what what rules

Speaker 1

这非常重要。

are It's super important.

Speaker 2

是的。

Yeah.

Speaker 2

是的。

Yeah.

Speaker 2

所以如果我们去掉一些名字，就会变得更容易一些，是的。

And so if we you know, take off a few names, it gets a little easier to Yeah.

Speaker 2

是的。

Yeah.

Speaker 2

更容易阅读。

A little easier to read.

Speaker 1

是的。

Yeah.

Speaker 1

但你能看到01的跃升吗？我超爱这一点。

But you can I love that you can see the 01 jump?

Speaker 1

看看这个。

Look at that.

Speaker 1

这是2024年9月。

This is September 2024.

Speaker 2

还有deepseq的跃升。

And the deepseq jump.

Speaker 2

是的，没错。

That is Yeah.

Speaker 2

它已经接近OpenAI的领先地位了。

Which got close to OpenAI's leadership.

Speaker 1

他们简直太接近了。

They were so close.

Speaker 2

我觉得是的。

I think yeah.

Speaker 2

我们记得那个时刻。

We we remember that moment.

Speaker 2

实际上，就在去年这个时候。

Around this time last year, actually.

Speaker 1

是的。

Yeah.

Speaker 1

是的。

Yeah.

Speaker 1

同意。

Agreed.

Speaker 0

是的。

Yeah.

Speaker 0

嗯，几周前。

Well, a couple of weeks.

Speaker 0

当DeepSeek P3发布时，那正是新西兰的波士顿日。

It was it was Boston day in New Zealand when when DeepSeek p three came out.

Speaker 0

我们一直关注DeepSeek以及其他一些不太为人所知的全球玩家，从2024年开始就对早期版本进行过评估。

And I like, we'd been tracking DeepSeek and a bunch of the other global players that were less known over, like, the 2024 and had run evals on the earlier ones and stuff.

Speaker 0

我清楚地记得新西兰的节礼日。

I I very distinctly remember Boxing Day in New Zealand.

Speaker 0

因为当时我和家人一起过圣诞节，同时在运行评估，不断收到关于DeepSeek v3的评估结果。

I because I was with family for Christmas and stuff, running the evals and getting back result by result on DeepSeek v three.

Speaker 0

这是他们v3架构的首个版本，即671B MOE。

So this was like the the first of their v three architecture, the six seven one b MOE.

Speaker 0

我们非常非常印象深刻。

And we were very, very impressed.

Speaker 0

就在那一刻，我们确信DeepSeek不再只是众多突然崛起的玩家之一。

Like, that was the moment where we were sure that DeepSeek was no longer just one of many players that had jumped up to be a thing.

Speaker 0

当他们随后在v3基础上推出RL并成功推出R1时，整个世界才真正注意到他们，那是在几周之后。

The world really noticed when they followed that up with the RL working on top of e three and r one succeeding like a few weeks later.

Speaker 0

但这一切的基础，正是去年节礼日我们拥有的那个极其强大的基础模型——完全开源权重，当时它是我们拥有的最佳开源模型。

But the groundwork for that absolutely was laid with, like, just extremely strong base model, completely open weights, that we had as the best open weights model on Boxing Day last year.

Speaker 2

是的。

Yep.

Speaker 2

Boxing Day 是圣诞节后的那一天，对那些不熟悉的人而言。

Boxing Day is the the day after Christmas for for those not

Speaker 0

不。

to No.

Speaker 1

我是新加坡人。

I mean, I'm I'm from Singapore.

Speaker 1

我们很多人记得 Boxing Day 是因为那场海啸。

A lot of us remember Boxing Day for for a different reason, for the tsunami that happened.

Speaker 0

哦，当然。

Oh, of course.

Speaker 1

是的。

Yeah.

Speaker 1

所以我当时在。

So I was yeah.

Speaker 1

嗯。

Yeah.

Speaker 1

但那是很久以前的事了。

But that was a long time ago.

Speaker 1

所以啊。

So yeah.

Speaker 1

所以这是AQI的大致内容吗？还是AII？

So this is the the rough pitch of AQI or is it AQI or AII?

Speaker 0

AII。

AII.

Speaker 0

好的。

So Okay.

Speaker 0

不过你的记忆力真不错。

Good good memory though.

Speaker 2

所以我知道。

So I know.

Speaker 0

我不太习惯这个说法。

I'm not used to it.

Speaker 0

以前我们确实叫它质量指数。

Once upon a time, we did call it quality index.

Speaker 0

好的。

Okay.

Speaker 0

我们以前会谈论质量、性能和价格，但后来改成了智能。

And we would talk about quality performance and price, but we changed it to intelligence.

Speaker 0

没错。

Yep.

Speaker 2

命名确实改过几次。

There's been a few naming changes.

Speaker 2

我们在网站上增加了硬件基准测试，并在系统层面设置了基准。

We added hardware benchmarking to the site and set benchmarks at a at a kind of system level.

Speaker 2

因此，我们将吞吐量指标改为了现在的输出速度，因为吞吐量在系统层面更有意义。

And so then we changed our throughput metric to we now call it output speed, then throughput makes sense at a system level.

Speaker 1

明白了。

Got it.

Speaker 1

采用了这个名字。

Got took that name.

Speaker 1

给我多讲讲这些图表。

Take me through more charts.

Speaker 1

比如，人们应该知道些什么？

Like, what should what should people know?

Speaker 1

你知道，你看待这个网站的方式，可能和初学者不一样。

You know, obviously, the way you look at the site is probably different than how a beginner might look at it.

Speaker 0

是的。

Yes.

Speaker 0

这很公平。

That that's fair.

Speaker 0

我们有很多有趣的内容可以深入探讨。

We can there's there's a of fun stuff to dive into.

Speaker 0

也许我们可以跳过所有那些，比如我们有很多很多视频和相关内容。

Maybe so we can hit past all the, like we've, like we have lots and lots of videos and stuff.

Speaker 0

今天值得讨论的一些有趣内容，是我们最近的一些成果，我想可能还没多少人熟悉。

The interesting ones to talk about today that'd be great to bring up are, like, a few of our recent things, I think, that probably not many people will be familiar with yet.

Speaker 0

所以第一个是我们的全知指数。

So first one of those is our omniscience index.

Speaker 0

这个指标和我们通常运行的大多数智能评估有些不同。

So this one is a little bit different to most of the intelligence evals that we run.

Speaker 0

我们专门构建它来考察模型中嵌入的知识，并通过测试模型在不知道答案时是选择说‘我不知道’还是给出错误答案，来评估其幻觉情况。

We built it specifically to look at the embedded knowledge in the models and to test, hallucination by looking at when the model doesn't know the answer, so I'm not able to get it correct, what's its probability of saying I don't know or giving an incorrect answer.

Speaker 0

我们用于全知指数的评分范围是从-100到+100，因为每给出一个错误答案，就会扣一分。

So the metric that we use for omniscience goes from negative a 100 to positive a 100 because we're simply taking off a point if you give an incorrect answer to the question.

Speaker 0

我们非常确信，在这种情况下，这样做是最合理的，因为对于事实性问题，说‘我不知道’显然比给出错误答案更有帮助。

We're pretty convinced that this is an example of where it makes most sense to do that because it's strictly more helpful to say I don't know instead of giving a wrong answer to factual knowledge question.

Speaker 0

我们的目标之一是改变评估体系对模型及其开发实验室所形成的激励导向，使其更倾向于获得更高分数。

And one of our goals is to shift the incentive that evals create for models and the labs creating them to get higher scores.

Speaker 0

到目前为止，人工智能领域几乎所有的评估都以正确率百分比作为主要指标，也是最受关注的焦点。

And almost every eval across all of AI up until this point, it's been graded by simple percentage correct as the main metric, the main thing that gets hyped.

Speaker 0

因此，你应该尝试回答每一个问题。

And so you should take a shot at everything.

Speaker 0

没有任何激励促使你说‘我不知道’。

There's no incentive to say, I don't know.

Speaker 0

所以我们在这里为此做了调整。

So we did that for this one here.

Speaker 1

我认为还有一个关于校准的普遍领域，即你对答案的信心与实际正确性之间的关系。

I think there's a general field of calibration as well, like the confidence in your answer versus the Yeah.

Speaker 1

答案的确定性。

Brightness of the answer.

Speaker 0

是的。

Yeah.

Speaker 0

我们完全同意。

We completely agree.

Speaker 2

嗯。

Yep.

Speaker 2

是的。

Yeah.

Speaker 2

关于这一点，我们没有将它纳入这个指标的一个原因是，我们认为正确的方法不是直接问模型它们有多自信。

On that, and one reason that we didn't do that is be or put that into this index is that we think that the the way to do that is not to ask the models how confident they are.

Speaker 1

我不知道。

I don't know.

Speaker 1

也许吧。

Maybe.

Speaker 1

但可能是这样。

It might be though.

Speaker 0

你这么说的话，即使

You put it like Even

Speaker 1

在JSON字段中，比如写上置信度，也许模型会输出一些内容。

at JSON fields, say say confidence and maybe it spits out something.

Speaker 1

嗯。

Yeah.

Speaker 1

你知道，这些年来我们做过几次评估播客。

You know, we have done a few evals podcast over the over the years.

Speaker 1

我们和Hugging Face的克莱门汀做过一期。

And we did one with Clementine of Hugging Face.

Speaker 1

克莱门汀是开源排行榜。

Clementine's the open source leaderboard.

Speaker 1

这是她最希望我们做的内容之一，即某种幻觉或置信度校准问题。

And this was one of her top requests, which is some kind of hallucination slash lack of confidence calibration thing.

Speaker 0

所以，嘿，这就是其中一个。

And so, hey, this is one of them.

Speaker 0

而且，就像我们做的任何事情一样，这并不是一个完美的指标，也不能涵盖你对幻觉的所有思考。

And I mean, like anything that we do, it's not a perfect metric or the whole story of everything that you think about as hallucination.

Speaker 0

嗯。

Mhmm.

Speaker 0

但没错，它非常有用，而且有一些有趣的结果。

But, yeah, it's pretty useful and has some interesting results.

Speaker 0

比如，我们在幻觉率中发现，位于这里最左侧的Anthropic Claude模型在我们用Omniscience评估的所有模型中幻觉率最低。

Like, one of the things that we saw in the hallucination rate is that anthropics Claude models at the the the very left hand side here were the lowest hallucination rates out of the models that we've evaluated omniscience on.

Speaker 0

这是一个有趣的事实。

That is an interesting fact.

Speaker 0

我认为这很可能与人们之前对某些Claude模型的喜爱有关，而这些喜爱之前并未被真正衡量过。

I think it probably correlates with a lot of the previously not really measured vibe stuff that people like about some of the Claude models.

Speaker 1

这个数据集是公开的吗？

Is the dataset public?

Speaker 1

或者有没有一个保留集？

Or what's is it is there a held out set?

Speaker 0

这个数据集有一个保留集。

There's a held out set for this one.

Speaker 0

所以我们发布了一个公开的测试集，但只发布了其中的10%。

So we we have published a public test set, but we we've only published 10% of it.

Speaker 0

原因是，对于这个特定的测试集，由于它完全是事实性知识问题，非常容易出现数据污染。

The reason is that for this one here specifically, it would be very, very easy to, like, have data contamination because it is just factual knowledge questions.

Speaker 0

我们会随着时间推移更新它，以进一步防止这种情况，但目前我们保留了大部分数据作为保留集，以确保长期的可靠性。

We will update it over time to also prevent that, but we've, yeah, kept most of it held out so that we can keep it reliable for a long time.

Speaker 0

这为我们带来了一系列非常有趣的应用，包括按主题进行非常细致的分析。

It leads us to a bunch of really cool things, including breakdown quite granularly by topic.

Speaker 0

目前我们已经在网站上公开了部分相关内容，未来还将进一步披露更多关于细分具体主题的能力。

And so we've got some of that disclosed on the website publicly right now, and there's lots more coming in terms of our ability to break out very specific topics.

Speaker 1

是的。

Yeah.

Speaker 1

我对此很感兴趣。

I would be interested.

Speaker 1

让我们再深入聊聊这个幻觉问题。

Let's let's dwell a little bit on this hallucination one.

Speaker 1

我注意到，Haiku 的幻觉率低于 Sonnet，也低于 Opus。

I noticed that Haiku hallucinate hallucinate is less than sonnet, hallucinate is less than opus.

Speaker 1

在正常的性能环境中，情况会反过来吗？

And would that be the other way around in a normal capability environment?

Speaker 1

我不知道。

I don't know.

Speaker 1

你怎么看待这一点？

What's what do make of that?

Speaker 2

一个有趣的方面是，我们发现智力与幻觉率之间并没有很强的相关性。

One interesting aspect is that we've found that there's not really a not a strong correlation between intelligence and hallucination rate.

Speaker 2

也就是说，模型在一般意义上的智能程度，并不与其在不了解某事时能够承认自己不知道的能力相关。

That's to say that the smarter the the models are in a general sense isn't correlated with their ability to, when they don't know something, say that they don't know.

Speaker 2

有趣的是，Gemini 3 Pro 预览版在这方面的表现相比 Gemini 2.5 Flash 和 2.5 Pro 有了巨大飞跃。

It's interesting that Gemini three Pro preview was a big leap over here, Gemini 2.5 Flash and and and 2.5 Pro.

Speaker 2

但如果我在这里快速加上 Pro 的话

But and if I add Pro quickly here

Speaker 1

我猜 Pro 会非常出色。

I bet Pro's really good.

Speaker 1

实际上，我的意思是GPT Pro。

Actually, So I meant I meant the GPT Pros.

Speaker 1

哦。

Oh.

Speaker 1

对。

Yeah.

Speaker 1

因为GPT Pro据说...

Because GPT pros are rumored.

Speaker 1

我们并不确定它是不是八个运行，然后在上面加一个LM评判器。

We don't know for a fact that it's like eight runs and then with the LM judge on top.

Speaker 2

对。

Yeah.

Speaker 2

所以我们看到准确率有了很大提升，这就是它们答对的百分比。

So we saw a big jump in this is accuracy, so this is just percent that they get correct.

Speaker 2

Gemini 3 Pro知道的东西比其他模型多得多。

And Gemini three Pro knew a lot more than the other models.

Speaker 2

因此准确率大幅提升，但在Google Gemini模型的不同版本之间，变化相对不大。

And so big jump in accuracy, but relatively no change between the Google Gemini models between releases.

Speaker 0

还有幻觉率。

And the hallucination rate.

Speaker 2

没错。

Exactly.

Speaker 2

因此，这很可能是由于Claude模型之间不同的后训练方法所导致的。

And so it's likely due to just kind of different post training recipe between the Claude models that's driven this.

Speaker 0

是的。

Yeah.

Speaker 0

你部分地将责任归咎于我们对智能的定义——直到现在，我们都没有将幻觉视为智能中的负面因素。

You partially blame us on how we define intelligence, having until now not defined hallucination as negative in the way that we think about intelligence.

Speaker 0

因此，我们正在改变这一点。

And so that's what we're changing.

Speaker 1

我知道很多聪明人，他们自信地犯着错误。

I know many smart people who are confidently incorrect.

Speaker 0

看。

Look.

Speaker 0

看。

Look.

Speaker 0

那，那，那是非常人性化的。

That that that is Very human.

Speaker 0

非常正确。

Very true.

Speaker 2

在某些时候和场合，这是合适的。

And there's times and a place for that.

Speaker 2

我们的观点是，在涉及知识的语境中，幻觉率是有意义的，但在许多情况下，人们希望模型能进行创造性发挥，大胆尝试。

Think our view is that hallucination rate makes sense in this context where it's around knowledge, but in many cases, people want the models to hallucinate, to have a go.

Speaker 2

在编程或试图生成新想法时，通常就是这种情况。

Often, that's the case in coding or when you're trying to generate newer ideas.

Speaker 2

我们添加到人工分析中的一个评估指标是关键点，即非常困难的物理问题。

One eval that we added to artificial analysis is is is critical point, and it's really hard physics problems.

Speaker 1

好的。

Okay.

Speaker 1

这有点像人类评估类型，还是别的类型，或者说是前沿数学类型？

And Is is it sort of like a human eval type or something different or like a frontier math type?

Speaker 2

这与前沿数学并不相似。

It's not dissimilar to frontier frontier math.

Speaker 2

这些是一些研究问题，物理学领域的学者们能够解答。

So these are kind of research questions that kind of academics in the physics physics world would be able to answer.

Speaker 2

但模型很难回答这些问题。

But models really struggle to answer.

Speaker 2

因此，这里的最高得分是9%。

So the top score here is 9%.

Speaker 2

而创建这个的人，比如Minwe，还有实际上背后推动Sweebench的Ophea，

And when the people that that created this, like, Minwe and and actually, Ophea, who was kind of behind Sweebench

Speaker 1

这个组织是什么？

What organization is this?

Speaker 1

还是这是普林斯顿？

Or is this it's Princeton?

Speaker 2

来自不同学术机构的一系列学者。

Kind of range of academics from from different academic institutions.

Speaker 2

非常聪明的人。

Really smart people.

Speaker 2

他们谈到如何提高模型的温度参数。

They talked about how they turn the models up in terms of the temperature.

Speaker 2

在尝试探索物理学中的新想法时，他们会把温度调到尽可能高，把模型当作思维伙伴，因为他们希望模型能产生幻觉。

As high temperatures as they can when they're trying to explore kinda new ideas in physics as a as a thought partner just because they they want the models to hallucinate.

Speaker 2

嗯。

Mhmm.

Speaker 1

嗯。

Mhmm.

Speaker 1

是的。

Yeah.

Speaker 1

有时他们会展示一些新的东西。

Sometimes they feature something new.

Speaker 1

嗯。

Yeah.

Speaker 1

没错。

Exactly.

Speaker 1

所以

Speaker 2

并不是在所有情况下都适用，但我觉得在合适的情境下测试幻觉是有道理的。

not right in every situation, but think it makes sense, you know, to test hallucination in scenarios where it makes sense.

Speaker 1

那么显而易见的问题是，这只是众多问题中的一个：每个实验室都有一个系统卡片，上面显示某种幻觉指标，而你选择不采纳它，而是自己制定了一套。

Well, so the obvious question is, this is one of many that there is the every lab has a system card that shows some kind of hallucination number and you've chosen to not endorse that and you've made your own.

Speaker 1

我认为这是一种选择。

And I think that's a that's a choice.

Speaker 1

完全正确。

Totally.

Speaker 1

在某种意义上，其他的人工智能分析都依赖于公开的基准，其他人可以独立地重新运行。

In in some sense, the rest of artificial analysis is public benchmarks that other people can independently rerun.

Speaker 1

你将其作为一种服务提供。

You provide it as a service.

Speaker 1

在这里，你必须应对这样的质疑：我们凭什么来做这件事？

Here you have to fight the well, who are we to to like do this?

Speaker 1

你的回答是，我们拥有大量客户，你知道的。

And your your answer is that we have a lot of customers in, you know.

Speaker 1

但我想问的是，你们如何让整个行业达成一致，接受一个所有人都认同的指标？

But like, I guess, how do you converge the industry on one number that actually everyone agrees is is the rates.

Speaker 1

对吧？

Right?

Speaker 1

因为你们有你们的数字，他们有他们的数字，两者永远无法统一。

Because you have your numbers, they have their numbers, never the the two shall meet.

Speaker 0

我的意思是，我认为对于幻觉这一特定问题，存在多种你可能合理关注的不同方面，而这些方面需要采用截然不同的测量方式。

I mean, I think I think for hallucinations specifically, there are a bunch of different things that you might care about reasonably and that you'd measure quite differently.

Speaker 0

我们称之为AAMnessian的幻觉率，并不是想宣称它就是

Like, we've called this AAMnessian's hallucination rate, not trying to declare it like it's

Speaker 1

人类最后的幻觉。

Humanity's last hallucination.

Speaker 0

你可以有一些有趣的命名规范之类的东西。

You can can have some interesting naming conventions and all this stuff.

Speaker 0

关于这个问题最宏观的答案，其实我想提一下，正如乔治所解释的，这也是一个关键点：随着我们向前推进，我们正在内部构建评估体系。

The biggest picture answer to that, something that I actually wanted to mention, just as George was explaining, critical point as well, is so as we go forward, we are building evals internally.

Speaker 0

我们正在与学术界和AI公司合作，共同打造优秀的评估体系。

We're partnering with academia and partnering with AI companies to build great evals.

Speaker 0

我们对AI栈的不同部分有着明确的看法，有些方面目前没有被很好地衡量，或者开发者关心的某些内容应该得到更充分、更精准的测量，我们打算去做这些工作。

We have pretty strong views on in various ways, for different parts of the AI stack, where there are things that are not being measured well or things that developers care about that should be measured more and better, and we intend to be doing that.

Speaker 0

我们并不执着于所有事情都必须完全由我们自己的团队来完成。

We're not obsessed necessarily with that everything we do, we have to do entirely within our own team.

Speaker 0

关键点是一个很好的例子，我们作为联合发起方与学术界合作推出了它。

Critical Point is a cool example of where we were a launch partner for it working with academia.

Speaker 0

我们即将与几家领先公司建立合作关系。

We've got some partnerships coming up with a couple of leading companies.

Speaker 0

这些合作，显然我们在一些独立事项上需要谨慎，但只要有适当的披露，我们就完全没问题。

Those ones, obviously, we have be careful with on some of the independent stuff, but with the right disclosure, like, we're completely comfortable with that.

Speaker 0

许多实验室过去发布过一些优秀数据集，我们曾用它们独立取得了成功。

A lot of the labs have released great datasets in the past that we've used to create success independently.

Speaker 0

因此，结合所有这些方法，我们未来将发布更多内容。

And so it's between all those techniques we're gonna be releasing more stuff in the future.

Speaker 1

很棒。

Cool.

Speaker 1

我们来聊聊最后几个话题，然后谈谈你的趋势分析内容好吗？

Let's cover the the last couple, and then we'll why don't we talk about your trends analysis stuff?

Speaker 1

你知道的。

You know?

Speaker 0

当然。

Totally.

Speaker 0

在那之前，实际上，我有一个小事实：如果你回看奥德赛模型的准确率，这个准确率指标的一个有趣之处在于，它比我们测量的其他任何指标都更紧密地反映了模型的总参数量。

Before that, actually, I have one, like, little factoid on If you go back up to accuracy on Odysseus, an interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure the total parameter count of models.

Speaker 0

这在直觉上很有道理。

Makes a lot of sense intuitively.

Speaker 0

对吧？

Right?

Speaker 0

因为这是一个知识评估。

Because this is a knowledge eval.

Speaker 0

这是一个纯粹的知识指标。

This is the pure knowledge metric.

Speaker 0

我们并没有关注索引和幻觉率这些我们认为更多反映模型训练方式的指标。

We're not looking at the index and the hallucination rate stuff that we think is much more about how the models are trained.

Speaker 0

这仅仅关乎它们回忆起哪些事实，而确实，它与参数量的关联非常紧密。

This is just what facts do they recall, and, yeah, it tracks parameter count extremely closely.

Speaker 1

好的。

Okay.

Speaker 1

传闻中 GBT 三 Pro 的规模是多少？

What's the rumored size of GBT three Pro?

Speaker 1

需要明确的是，这并非来自任何官方来源的确认，

And to be clear, not confirmed for any official source,

Speaker 0

只是传闻而已。

just just rumors.

Speaker 1

但传闻总是到处流传。

But rumors do fly around.

Speaker 1

我听到各种各样的数字。

Rumors I get hear all sorts of numbers.

Speaker 1

我不知道该相信哪个。

I don't know what to trust.

Speaker 0

如果你把 Amnesty 的准确率与总参数量画出来，我们就有所有开源权重的模型。

So if you if if you draw the line on Amnesty's accuracy versus total parameters, we've got all the open weights models.

Speaker 0

你可以眯着眼看，目前领先的前沿模型很可能比开源权重模型所达到的 1 万亿参数还要大得多，也就是我们这里所讨论的这些模型。

You can squint and see that likely the leading frontier models right now are quite a lot bigger than the 1,000,000,000,000 parameters that the open weights models cap out at and the ones that we're looking at here.

Speaker 0

埃隆·马斯克最近透露了一个有趣的额外信息：关于xAI的Grok三和Grok四，参数量为3万亿，而Grok五则为6万亿，但Grok五尚未发布。

There's an interesting extra data point that Elon Musk revealed recently about x AI that Grok three and four 3,000,000,000,000 parameters for Grok three and four, 6,000,000,000,000 for Grok five, but that's not out yet.

Speaker 0

把这些数据放在一起看。

Take those together.

Speaker 0

仔细看看。

Have a look.

Speaker 0

你可能会合理地认为，Gemini三Pro的参数量很可能更大，可能在五到十万亿的范围内。

You might reasonably form a view that there's a pretty good chance that Gemini three Pro is bigger than that, that it could be in the five to ten trillion parameter range.

Speaker 0

需要明确的是，我完全不知道。

To be clear, I have absolutely no idea.

Speaker 0

但仅根据这张图表，如果你仔细分析，就会得出这样的结论。

But just based on this chart, like, that's where you would you would land if you have a at it.

Speaker 1

是的。

Yeah.

Speaker 1

在某种程度上，我其实不太鼓励人们过多猜测，因为这到底有什么意义呢？

And to some extent, I actually kinda discourage people from guessing too much because what does it really matter?

Speaker 1

只要他们能以可持续的成本提供服务，那就够了。

Like, as long as they can serve it as a sustainable cost, that's about it.

Speaker 1

就像

Speaker 0

是的。

Yeah.

Speaker 0

完全正确。

Totally.

Speaker 0

他们还

They've also

Speaker 2

与那些致力于支持他人自行部署的开源模型相比，这些实验室在大规模推理时有着不同的激励机制。

got different incentives in play compared to, like, open weights models who are thinking to supporting others in self deployment for the labs who are doing inference at scale.

Speaker 2

在考虑推理成本时，我认为很多时候更关键的不是总参数量，而是活跃参数的数量。

It's, I think, less about total parameters in many cases when thinking about inference costs and and more around number of active parameters.

Speaker 2

因此，存在一种倾向于更大、更稀疏模型的激励。

And so there's a bit of an incentive towards larger, sparser models.

Speaker 1

同意。

Agreed.

Speaker 1

明白了。

Understood.

Speaker 0

嗯。

Yeah.

Speaker 0

很好。

Great.

Speaker 0

我的意思是，显然，如果你是使用这些技术的开发者或公司，正如你所说，这些都不重要。

I mean, obviously, if you're a developer or company using these things, none of exactly as you say, it doesn't matter.

Speaker 0

你应该关注我们衡量智能的各种方式。

You should be looking at all the different ways that we measure intelligence.

Speaker 0

你应该关注运行索引的成本，以及基于标价的令牌效率和成本效率的不同考量，因为这才是唯一重要的。

You should be looking at our cost to run index number and the different ways of thinking about token efficiency and cost efficiency based on the list prices because that's all it matters.

Speaker 1

对于内容创作者的谣言市场来说，这并不那么有利，因为我可以说，哦，GPT-4 是这么小的一个圈子。

It's not as good for the content creator rumor mill where I can say, oh, GPT four is this small circle.

Speaker 1

看看GPT五是一个大圈。

Look at GPT five is this big circle.

Speaker 1

而那个曾经一度是个热门话题。

And that and then that used to be a thing for a while.

Speaker 0

是的。

Yeah.

Speaker 0

我的意思是，但那本身其实是一个非常有趣的问题。

I mean, but that that that is like a on on its own actually very interesting one.

Speaker 0

对吧？

Right?

Speaker 0

就是这样吗？

That Is it?

Speaker 0

好吧，仅仅从概率上讲，过去几年里这些模型的总规模并没有显著扩大。

Well, just purely that chances are the last couple of years haven't seen a dramatic scaling up in the total size of these models.

Speaker 0

所以是的。

And so Yeah.

Speaker 0

碰到

Hit

Speaker 1

瓶颈了。

a wall.

Speaker 0

模型总规模可能还有很大的提升空间，尤其是随着即将到来的硬件代际更新。

A lot of room to go up, probably, in total size of the models, especially with the upcoming hardware generations.

Speaker 1

是的。

Yes.

Speaker 1

所以，让我暂时摘下我那胡说八道的面具。

So, you know, taking off my shitposting face for a minute.

Speaker 1

是的。

Yes.

Speaker 1

是的。

Yes.

Speaker 1

与此同时，我确实觉得，尤其是从欧洲回来后，人们普遍认为伊利亚可能是对的，这种范式再想扩展几个数量级已经很难了，因此我们需要开始探索至少一条不同的路径。

At the same time, I I do feel like, you know, especially coming back from the Europes, people do feel like Ilya is probably right that the paradigm is doesn't have many more orders of magnitude to scale out more and therefore we need to start exploring at least a different path.

Speaker 1

GDP估值，我觉得大概只有一个多月大了。

GDP val, I think it's like only like a month or so old.

Speaker 1

我刚出来的时候也非常乐观。

I was also very positive when I first came out.

Speaker 1

我实际上联系过Tejal，她是那项研究的主要负责人。

I actually talked to Tejal who was the the lead researcher on that.

Speaker 0

好的。

Okay.

Speaker 1

你也有自己的版本。

And you have your own version.

Speaker 2

这是一个绝佳的数据集。

It's a fantastic dataset.

Speaker 1

是的。

Yeah.

Speaker 1

也许我该给那些还蒙在鼓里的人做个简要回顾。

Maybe I will recap for people who are still out of it.

Speaker 1

它包含44项任务，基于某种GDP门槛，旨在代表不仅仅是编程的广泛白领工作。

It's like 44 tasks based on some kind of GDP cutoff that's like meant to represent broad white collar work that is not just coding.

Speaker 0

没错。

Yep.

Speaker 0

是的。

Yeah.

Speaker 0

每项任务都有大量详细的说明，其中许多任务还附带输入文件。

Each of the tasks have a whole bunch of detailed instructions, some input files for a lot of them.

Speaker 0

在这44项任务中，又细分为大约220个子任务，每个子任务的复杂程度是我们通过智能体系统运行的级别。

It's I with within the 44 is divided into, like, 220, two to five maybe subtasks that are the the level of that we run through the agenticarnas.

Speaker 0

而且，它们真的很有意思。

And, yeah, they're really interesting.

Speaker 0

我要说，它并不一定涵盖人们在工作中做的所有事情。

I will say that it doesn't necessarily capture, like, all the stuff that people do at work.

Speaker 0

没有完美的方案。

No avail is perfect.

Speaker 0

总会有一些更多的事情需要关注，主要是因为为了使这些任务足够完善以便运行，它们只能包含少量的输入文件和非常具体的指令。

There's always gonna be more things to look at, largely because in order to make the tasks well enough to find that you can run them, they need to only have a handful of input files and very specific instructions for that task.

Speaker 0

因此，我认为最简单的理解方式是，它们就像是面试过程中可能遇到的、难度较高的回家考试任务。

And so I think the easiest way to think about them are that they're, like, quite hard to take home exam tasks that you might do in an interview process.

Speaker 1

是的。

Yeah.

Speaker 1

对于听众来说，这不再是一个很长的提示。

For listeners, it is not no longer, like, a long prompt.

Speaker 1

它更像是一个包含电子表格、PowerPoint演示文稿或PDF文件的压缩包。

It is like, well, here's a zip file with, like, a spreadsheet or a PowerPoint deck or a PDF.

Speaker 1

去尽情发挥，回答这个问题吧。

Go go nuts and answer this question.

Speaker 2

是的。

Yeah.

Speaker 2

OpenAI发布了一个很棒的数据集，并发表了一篇优秀的论文，分析了不同网络聊天机器人在该数据集上的表现。

OpenAI released a great dataset, and they released a good paper which looks at performance across the different, you know, web chatbots on the dataset.

Speaker 2

这是一篇很棒的论文。

It's a great paper.

Speaker 2

鼓励大家去阅读。

Encourage people to read it.

Speaker 2

我们所做的，是将这个数据集转化为一个可以在任何模型上运行的评估任务。

What we've done is taken that dataset and turned it into an eval that can be run on any model.

Speaker 2

因此，我们创建了一个参考性的智能代理框架，可以在该数据集上运行模型，然后我们开发了一种名为VALUE AT的方法来比较输出，这种方法具有一定的AI能力。

So we created a reference agentic harness that can run the models on the dataset, and then we developed a VALUE AT approach to compare outputs that's kinda AI enabled.

Speaker 2

它使用Gemini 3 Pro预览版来比较结果，我们对此进行了非常全面的测试，以确保其与人类偏好保持一致。

So it uses Gemini three Pro preview to compare results, which we tested pretty comprehensively to ensure that it's aligned to to human human preferences.

Speaker 2

其中一个数据点是，即使作为评估器，Gemini 3 Pro在GDP val AA上的表现也并不理想。

One data point there is that even as the as as an evaluator, Gemini three Pro, interestingly, doesn't do actually that well in GDP val AA.

Speaker 1

是的。

Yeah.

Speaker 1

使用LLM作为评判者时，你需要警惕的是自我偏好，模型通常更偏爱自己的输出。

The the thing that you have to watch out for with LLM judge is self preference, that models usually prefer their own output.

Speaker 1

在这种情况下，并没有。

And in this case, it was not.

Speaker 0

完全正确。

Totally.

Speaker 0

我认为我们现在考虑在哪些地方使用大语言模型作为评判者的方式，与几年前早期的LLM作为评判者的方法截然不同。

I think the the the way that we're that we're thinking about the places where it makes sense to use an LLM as judge approach now, like, quite different to some of the early LLM as judge stuff a couple of years ago.

Speaker 0

因为其中一些，MTVH是一个很棒的项目。

Because some of that and MTVH was a great project.

Speaker 0

它几年前是一个很好的例子，当时主要是评判对话和一些风格类的内容。

It was a good example of some of this a while ago, was about judging conversations and, like, a lot of style type stuff.

Speaker 0

在这里，评分者和评分模型所执行的任务与参加测试的任务大不相同。

Here, we've got the task that the grader and grading model is doing is quite different to the task of taking the test.

Speaker 0

当你参加测试时，你可以使用所有的智能工具。

When you're taking the test, you've got all of the agentic tools.

Speaker 0

你会使用代码解释器、网络搜索、文件系统，经过多次迭代来尝试生成文档。

You're working with the code interpreter on web search, the file system to go through many, many turns to try to create the documents.

Speaker 0

另一方面，我们对其进行绿色处理，通过一个管道提取文件的视觉和文本版本，并将这些内容提供给Gemini，同时提供任务的标准，让其从两个可能的结果中选出更符合任务标准的那个。

Then on the other side, we're greening it, we're running it through a pipeline to extract visual and text versions of the files and be able to provide that to Gemini, and we're providing the criteria for the task and getting it to pick which one more effectively meets the criteria of the task out of two potential outcomes.

Speaker 0

事实证明，它在判断正确性方面非常出色，很多时候与人类偏好高度一致，因为我认为它具备原始的智能，再加上对输出的正确表征——这些输出是通过与评分模型工作方式截然不同的智能体任务生成的，而我们是根据标准进行比较，而不是仅仅进行零样本提问让模型选择哪个更好。

It turns out that we proved that it's just very, very good at getting that right, matched with human preference a lot of the time because it's I think it's got the raw intelligence, but it's combined with the correct representation of the outputs, the fact that the outputs were created with an agentic task that is quite different to the way the grading model works, and we're comparing it against criteria, not just kind of zero shot trying to ask the model to pick which one is better.

Speaker 1

明白了。

Got it.

Speaker 1

为什么使用ELO评分而不是像GDP那样的百分比值？

Why is this an ELO and not a percentage like GDP value?

Speaker 2

所以输出看起来像文档，有些任务还会生成视频或音频输出。

So the outputs look like documents, and there's video outputs or audio outputs from some of the tasks.

Speaker 1

所以它必须生成一个视频？

And So it has to make a video?

Speaker 0

是的。

Yeah.

Speaker 0

对于某些任务来说。

For some of the tasks.

Speaker 0

有些

Some of

Speaker 2

任务。

the tasks.

Speaker 1

是哪个任务？

What task is that?

Speaker 1

我的意思是，它

I mean, it's

Speaker 2

它在数据中心里。

in it's in the data center.

Speaker 1

可能是YouTuber，或者是个营销视频。

It might be a YouTuber or It's a marketing video.

Speaker 0

哦，什么？

Oh, what?

Speaker 0

比如，模型需要去互联网上寻找片段并尝试拼接起来。

Like, model has to go find clips on the Internet and try to put it together.

Speaker 0

目前，这些模型在做这件事上还不够好，这一点要明确。

The models are not that good at doing that one for now, to be clear.

Speaker 0

用代码解释器来做这件事相当困难，计算机本身也还不够好，等等等等。

It's pretty it's pretty hard to do that with a code interpreter, and the computer yourself doesn't work quite well enough and so on and so on.

Speaker 0

但是

But

Speaker 2

是的。

yeah.

Speaker 2

因此，没有明确的基准可以用来比较，以计算正确率。

And so there's no kind of ground truth necessarily to compare against, to work out percentage correct.

Speaker 2

很难界定什么是正确或错误的。

It's hard to come up with correct or incorrect there.

Speaker 2

所以这是相对的，我们使用ELO评分来比较各个模型在任务之间的输出。

So it's on a relative basis, and so we use an ELO approach to compare outputs from each of the models between between the task.

Speaker 1

你知道你应该做什么吗？

You know what you should do?

Speaker 1

你应该雇一个承包商真人来做同样的任务，然后给它一个ELO评分。

You should you should pay a contractor human to do the same task and then give it an EVO.

Speaker 1

这样你就有了人类的基准。

And then so you have you have human.

Speaker 1

关于GDP值，OpenAI的那个方法的有用之处在于，50%被设定为普通人类的水平。

There is this think what's helpful about GDP val, the OpenAI one, is that 50% is meant to be normal human.

Speaker 0

是的。

Yes.

Speaker 0

而且

And

Speaker 1

而领域专家的水平可能高于这个标准。

and and maybe domain expert is higher than that.

Speaker 1

但50%是衡量标准，比如，只要你超过了50%，你就算是超人类了。

But 50% was the the bar for, like, well, if you've crossed 50, you are superhuman.

Speaker 0

没错。

Yeah.

Speaker 0

所以我们实际上并没有将这个分数完全建立在这一点上。

So we, like, haven't grounded this score in that exactly.

Speaker 0

我同意这可能有帮助，但我们希望将这一点推广到大量模型上。

I agree that it can be helpful, but we wanted to generalize this to a very large number of models.

Speaker 0

这正是展示ZLO非常有帮助的原因之一，它让我们能够添加更多模型，并且长期保持相关性。

It's one of the reasons that presenting a ZLO is quite helpful and allows us to add models, and it'll stay relevant for quite a long time.

Speaker 0

我也认为，将这些具体任务与人类表现进行比较可能会有些棘手，因为人类的做法与模型的做法截然不同。

I also think it it it can be tricky looking at these exact tasks compared to the human performance, because the way that you would go about it as a human is quite different to how the models would go about it.

Speaker 1

是的。

Yeah.

Speaker 1

我也喜欢你把Llama Four Maverick包含进来了。

I also like that you included Llama four Maverick in there.

Speaker 1

那是，比如，只是最后一个，嗯，不是。

Is that, like, just one last, like Well, no.

Speaker 0

不是。

No.

Speaker 0

不。

No.

Speaker 0

不。

No.

Speaker 0

不。

No.

Speaker 0

不。

No.

Speaker 0

它是Meta发布的最佳模型，因此目前仍被包含在主页默认集合中。

It is the it is the best model released by Meta, and so it makes it into the homepage default set still for now.

Speaker 2

另一个有趣的纳入是我们也在最新的网页聊天机器人上运行了它。

Other inclusion that's quite interesting is we also ran it across the latest versions of the web chatbots.

Speaker 2

所以我们有。

And so we have.

Speaker 1

哦，对了。

Oh, that's right.

Speaker 1

哦，抱歉。

Oh, sorry.

Speaker 1

呃，是的。

I yeah.

Speaker 1

我完全没注意到。

I completely missed that.

Speaker 1

好的。

Okay.

Speaker 2

不。

No.

Speaker 2

完全不是。

Not at all.

Speaker 2

所以那个有棋盘格图案的。

So the that which has a checkered pattern.

Speaker 1

所以你的意思是，那个是他们的装备，而不是你的？

So so that is their harness, not yours, is what you're saying?

Speaker 2

没错。

Exactly.

Speaker 2

真正有趣的是，如果你比较一下，比如使用Claude网页聊天机器人时的Claude 4.5 Opus，它的表现比我们代理式框架中的模型要差。

And what's really interesting is that if you compare, for instance, Claude 4.5 Opus using the Claude web chatbot, it performs worse than the model in our agentic harness.

Speaker 2

嗯。

Mhmm.

Speaker 2

因此，在每种情况下，我们的代理式框架中的模型表现都优于它们自己创建的网页聊天机器人版本。

And so in every case, the model performs better in our agentic harness than its web chatbot counterpart, the harness that they created.

Speaker 1

哦，我对此的反向解释是，它本来是为消费级用例设计的，而你们却在用它做别的事情。

Oh, my backwards explanation for that would be that, well, it's meant for consumer use cases, and here you're pushing it for something.

Speaker 0

约束条件不同，你能给予模型的自由度也不同。

The constraints are different and the amount of freedom you can give the model is different.

Speaker 0

而且，还得有成本目标。

Also, like, have a cost goal.

Speaker 0

我们，是的。

We Yeah.

Speaker 0

让模型尽可能长时间地运行。

Let the models work as long as they want, basically.

Speaker 1

是的。

Yeah.

Speaker 1

你会手动复制粘贴到聊天机器人里吗？

Do you copy paste manually into the chatbot?

Speaker 0

是的。

Yep.

Speaker 0

是的。

Yep.

Speaker 0

这就是我们获得聊天机器人基准数据的方式。

That's That was how we got the chatbot reference Bots.

Speaker 0

是的。

Yep.

Speaker 0

我们不会像运行那样频繁地更新这些内容

We we're not gonna be keeping those updated at, like, quite the same scale as running

Speaker 2

在

on the

Speaker 0

数百个模型上

hundreds of models on

Speaker 1

嗯，那个，你知道的，我不确定。

Well, the so and, you know, I don't know.

Speaker 1

和基于浏览器的聊聊。

Talk to a browser based.

Speaker 1

他们会为你自动化这个过程，你知道的，比如

They'll they'll automate it for you, you know, like

Speaker 2

确实如此。

It's true.

Speaker 1

是的。

Yep.

Speaker 2

对。

Yeah.

Speaker 2

我们应该。

We should.

Speaker 1

是的。

Yeah.

Speaker 1

我确实想过，我们应该把这些聊天机器人版本变成API，因为它们本身就是真正不同的代理。

I I have thought about, like, well, we should turn these chatbot versions into an API because they are legitimately different agents in themselves.

Speaker 2

是的。

Yes.

Speaker 2

对吧？

Right?

Speaker 0

没错。

Yep.

Speaker 0

在过去一年里，这已经有了巨大的增长。

And that's grown a huge amount of the last year.

Speaker 0

对吧？

Right?

Speaker 0

比如，我认为可用的工具在各大聊天机器人应用之间实际上已经出现了相当大的分歧，你能连接的数据源也大大增加，这意味着你的使用体验和方式比以往任何时候都更加不同。

Like, the tools that are available have actually diverged, my opinion, a fair bit across the major chatbot apps, and the amount of data sources that you can connect them to have gone up a lot, meaning that your experience and the way you're using the model is

Speaker 1

是的。

Yeah.

Speaker 0

比以往任何时候都更不同。

More different than ever.

Speaker 1

当你提到有趣的地方时，具体是指哪些工具和数据连接呢？

What tools and what data connections come to mind when you say what's interesting?

Speaker 1

有哪些值得注意的成果是人们已经完成的？

What what what what's notable work that people have done?

Speaker 0

哦，好的。

Oh, okay.

Speaker 0

我最喜欢的一个例子是，直到最近，我还认为让大语言模型帮我起草一封有用的邮件几乎是不可能的，因为大多数时候你发邮件并不是为了写而写。

So my favorite example on this is that until very recently, I would argue that it was basically impossible to get an LLM to draft an email for me in any useful way because most times that you're sending an email, you're not just writing something for the sake of writing it.

Speaker 0

很可能，你需要的上下文是一大堆历史邮件。

Chances are context required is a whole bunch of historical emails.

Speaker 0

也许是您做的笔记。

Maybe it's notes that you've made.

Speaker 0

也许是会议记录。

Maybe it's meeting notes.

Speaker 0

也许是从您存储资料的任何地方提取内容。

Maybe it's pulling something from your any of, like, wherever you it works store stuff.

Speaker 0

对我来说，比如谷歌云盘、OneDrive，以及我们需要做分析或数据处理时使用的Superbase数据库。

So for me, like, Google Drive, OneDrive, and our super base databases if we need to do some analysis or some data or something.

Speaker 0

理想情况下，模型可以接入所有这些地方并执行有用的任务。

Preferably, model can be plugged into all of those things and can go do some useful work.

Speaker 0

目前让我印象最深、也让我在2025年底感到惊讶的是，我可以利用Superbase MCP让模型进行查询。

Based on it, The things that, like, I find most impressive currently that I am somewhat surprised work really well in late twenty twenty five are that I can have models use super base MCP to Query.

Speaker 0

当然是只读模式，运行大量SQL查询来进行相当深入的数据分析，生成图表等，并能读取我的Gmail和Notion内容。

Read only, of course, run a whole bunch of SQL queries to do pretty significant data analysis and make charts and stuff and can read my Gmail and my Notion.

Speaker 1

好的。

And okay.

Speaker 1

你真的在用这个。

You actually use that.

Speaker 1

这很好。

That's good.

Speaker 1

这确实很好。

That's that's that's good.

Speaker 1

这是云服务吗？

Is that a cloud thing?

Speaker 0

在不同程度上，我知道JGBD和Claude，是的。

To various degrees of water, know both JGBD and Claude Yeah.

Speaker 0

现在。

Right now.

Speaker 0

公平地说，这些东西现在几乎没法用。

I would say that this stuff, like, barely works in fairness right now.

Speaker 0

好的。

Okay.

Speaker 0

好的。

Okay.

Speaker 0

比如

Speaker 1

因为人们在听到后真的会去尝试这个。

because people are actually gonna try this after they hear

Speaker 2

如果你收到迈克的邮件，很可能不是由聊天机器人写的。

If you get an email from Micah, odds are it wasn't written by chatbot.

Speaker 0

不。

No.

Speaker 0

所以是的。

So yeah.

Speaker 0

我认为确实如此，我从未真正用聊天机器人起草过邮件发给任何人。

I I think it is true that I have never actually sent anyone an email drafted by a chatbot yet.

Speaker 0

所以

And so

Speaker 1

但你能感觉到。

But you can you can feel it.

Speaker 1

对吧？

Right?

Speaker 1

没错。

And Yeah.

Speaker 1

明年这个时候，我们再回来看看进展如何。

This time this time next year, we'll come back and see where it's going.

Speaker 1

完全正确。

Totally.

Speaker 1

Supabase，致敬另一位著名的新西兰人。

Supabase, shout out another famous Kiwi.

Speaker 1

是的。

Yeah.

Speaker 1

我不知道你有没有和他聊过关于AI构建或AI基础设施的任何具体话题。

I don't know if you've you've any conversations with him about anything in particular on AI building and AI infra.

Speaker 2

我们曾通过Twitter私信联系过他，因为我们是Superbase的重度用户，甚至可以说是高级用户，我们可能在Superbase中做了些本不该手动完成的操作。

We have had Twitter DMs with with him because we're quite big Superbase users and and Power users, and we probably do some things more manually than we should in in Superbase.

Speaker 1

所以他是你的客服支持，因为你属于Q级用户？

So he's just the support line because you're you're a Q list?

Speaker 1

算是吧。

A little bit.

Speaker 2

是的。

Yeah.

Speaker 2

他非常友好。

Been super friendly.

Speaker 2

关于GDP val a a，还有一个额外的点：由于模型相较于聊天机器人表现更优，我们意识到，我们构建的参考框架实际上在通用型代理任务上表现相当不错。

One extra point regarding GDP val a a is that on the basis of the overperformance of the models compared to the the chatbots, turns out we realized that, oh, like, our reference harness that we built actually works quite well on like generalist agentic tasks.

Speaker 2

嗯。

Mhmm.

Speaker 2

这在某种程度上证实了这一点。

This proves it in a sense.

Speaker 2

因此，这个代理框架非常简洁。

And so the agent harness is very minimalist.

Speaker 2

我认为它遵循了一些Claude代码中的理念。

I think it follows some of the ideas that are in Claude code.

Speaker 2

我们赋予它的只有上下文管理能力、网页搜索和浏览工具、代码执行环境。

And we all that we give it is context management capabilities, a web search, web browsing tool, code execution environment.

Speaker 2

还有别的吗？

Anything else?

Speaker 0

我的意思是，我们可以给它配备更多工具，但默认情况下，就这些了。

I mean, we can equip it with more tools, but, like, by default, yeah, it's that's it.

Speaker 0

我们给它一个GDP评估工具来查看图像，是因为模型可以通过终端将文本内容拉入上下文，但要将视觉内容引入上下文，我们就得给它们一个自定义工具。

We we give it for a g d p eval tool to view an image specifically because the models, you know, can just use a terminal to pull stuff in text form into context, But to pull visual stuff into context, we had to give them a custom tool.

Speaker 0

是的。

Yeah.

Speaker 0

是的。

Yeah.