第74集：Together.AI首席科学家Tri Dao谈英伟达统治地位的终结、推理成本下降的原因及速度的下一个十倍提升

本集简介

填写这份简短的听众调查，帮助我们改进节目：https://forms.gle/bbcRiPTRwKoG2tJx8 Together AI首席科学家、普林斯顿大学教授Tri Dao（Flash Attention和Mamba的创造者）探讨了自ChatGPT发布以来，通过内存优化、稀疏性进步和软硬件协同设计，推理优化如何将成本降低100倍。他预测AI硬件格局将在2-3年内从NVIDIA当前90%的主导地位转向更加多元化的生态系统，因为针对不同工作负载类别的专用芯片将出现：低延迟代理系统、高吞吐量批处理和交互式聊天机器人。Dao分享了他对AI模型真正能用于专家级工作的惊讶，这使他通过Claude Code和O1等工具在GPU内核优化上的效率提高了1.5倍。对话探讨了当前的Transformer架构是否能达到专家级AI性能，或者是否需要混合专家和状态空间模型等方法以合理成本实现AGI。展望未来，Dao认为通过持续的硬件专业化、改进的内核和超稀疏模型等架构进步，成本还将再降低10倍，同时强调最大的挑战仍是为缺乏广泛互联网覆盖的领域生成专家级训练数据。 (0:00) 开场 (1:58) NVIDIA的主导地位与竞争对手 (4:01) 芯片设计中的挑战 (6:26) AI硬件创新 (9:21) AI在芯片优化中的作用 (11:38) AI与硬件抽象的未来 (16:46) 推理优化技术 (33:10) AI推理的专门化 (35:18) 深度工作偏好与低延迟工作负载 (38:19) 集群级优化与批处理推理 (39:34) 演变的AI工作负载与开源工具 (41:15) AI未来：代理工作负载与实时视频生成 (44:35) 架构创新与AI专家级 (50:10) 机器人与多分辨率处理 (52:26) 学术与工业在AI研究中的平衡 (57:37) 快问快答你的主持人： @jacobeffron - Redpoint合伙人，前Flatiron Health产品经理 @patrickachase - Redpoint合伙人，前LinkedIn机器学习工程师 @ericabrescia - 前GitHub首席运营官，Bitnami创始人（被VMWare收购） @jordan_segall - Redpoint合伙人

Fill out this short listener survey to help us improve the show: https://forms.gle/bbcRiPTRwKoG2tJx8 Tri Dao, Chief Scientist at Together AI and Princeton professor who created Flash Attention and Mamba, discusses how inference optimization has driven costs down 100x since ChatGPT's launch through memory optimization, sparsity advances, and hardware-software co-design. He predicts the AI hardware landscape will shift from Nvidia's current 90% dominance to a more diversified ecosystem within 2-3 years, as specialized chips emerge for distinct workload categories: low-latency agentic systems, high-throughput batch processing, and interactive chatbots. Dao shares his surprise at AI models becoming genuinely useful for expert-level work, making him 1.5x more productive at GPU kernel optimization through tools like Claude Code and O1. The conversation explores whether current transformer architectures can reach expert-level AI performance or if approaches like mixture of experts and state space models are necessary to achieve AGI at reasonable costs. Looking ahead, Dao sees another 10x cost reduction coming from continued hardware specialization, improved kernels, and architectural advances like ultra-sparse models, while emphasizing that the biggest challenge remains generating expert-level training data for domains lacking extensive internet coverage. (0:00) Intro(1:58) Nvidia's Dominance and Competitors(4:01) Challenges in Chip Design(6:26) Innovations in AI Hardware(9:21) The Role of AI in Chip Optimization(11:38) Future of AI and Hardware Abstractions(16:46) Inference Optimization Techniques(33:10) Specialization in AI Inference(35:18) Deep Work Preferences and Low Latency Workloads(38:19) Fleet Level Optimization and Batch Inference(39:34) Evolving AI Workloads and Open Source Tooling(41:15) Future of AI: Agentic Workloads and Real-Time Video Generation(44:35) Architectural Innovations and AI Expert Level(50:10) Robotics and Multi-Resolution Processing(52:26) Balancing Academia and Industry in AI Research(57:37) Quickfire With your co-hosts: @jacobeffron - Partner at Redpoint, Former PM Flatiron Health @patrickachase - Partner at Redpoint, Former ML Engineer LinkedIn @ericabrescia - Former COO Github, Founder Bitnami (acq’d by VMWare) @jordan_segall - Partner at Redpoint

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

Tridau在过去几年里推动了人工智能领域许多最重要的突破。他是Flash Attention的主要作者，这项工作是模型推理成本大幅下降的关键原因。他通过Mamba公司在安全空间模型工作中深度参与，这是对Transformer架构的替代方案。作为Together公司的首席科学家，他的工作影响如此深远，以至于Semi Analysis基本指出，英伟达护城河的重要部分就在于Tree在英伟达生态系统中做的所有这些工作。这真是无监督学习的殊荣。

Tridau has driven many of the most important breakthroughs in AI over the past years. He was a lead author of Flash Attention, which has been a key reason why model inference costs have gone down so much. He was super involved in safe space model work via his work at Mamba, the alternative architectures to transformers. He's the chief scientist at Together, and his work's been so impactful that semi analysis basically noted that a huge part of the NVIDIA moat is the fact that Tree's doing all this work in the NVIDIA ecosystem. It's a real privilege of unsupervised learning.

Speaker 0

能够与Tree这样的人坐下来，询问他任何最关心的问题，包括两三年后将在英伟达芯片上运行的工作负载比例。我们讨论了AI硬件的未来以及机会所在。我们谈到AI编程产品如何让Tree的工作效率提高了50%，这远高于我的预期。我们还讨论了他对Transformer替代架构未来的看法，以及什么可能真正有效。与领域内如此杰出的人物交谈真是难得的机会。

Get to sit down with someone like Tree and just ask them anything that's top of mind, including the percent of workloads that are gonna run on NVIDIA chips in two, three years. We talked about the future of AI hardware and where the opportunities are. We talked about how AI coding products have made Tree 50% more efficient in his work, which was way higher than I'd expected. And we also talked about, you know, his thoughts on the future of alternative architectures to transformers and what might really work. Just a fascinating opportunity to talk with someone who's so brilliant in his field.

Speaker 0

真的非常非常享受这次对话，我想大家也会喜欢的。闲话少说，有请Tree。Tree，非常感谢你来参加播客。真的很感激。

Really, really enjoyed this conversation, and I think folks will too. Without further ado, here's Tree. Well, Tree, thanks so much for for coming on the podcast. Really appreciate it.

Speaker 1

是的。当然。非常期待这次对话。

Yeah. Yeah. Sure. For sure. I'm really excited about this.

Speaker 0

我感觉当我两三年前开始这个播客时，你就在我的梦想嘉宾名单上。那时候Flash Attention刚开始流传，大家都在惊叹你对这个领域的影响，我当时就想，如果我们能慢慢把这个播客做起来，也许能说服Tree来参加。

I feel like when I started this podcast, you know, two and a half, three years ago, you were on my like dream guest list. And so, I think that was like when kind of Flash Attention, you know, was was initially making the rounds and people were oohing and ahhing over the impact you were having on the space, and I was like, if we really can build this up over time, maybe we'll convince Tree to come on, and so.

Speaker 1

是的，能来这里我感到非常荣幸。应该会很有趣的，没错。

Yeah, yeah, I mean, I'm very honored to be here. I mean, should be really fun, yeah.

Speaker 0

是的。我们会讨论一系列听众可能感兴趣的话题。我想从一个经典的VC问题开始，还请多包涵。英伟达过去几年的故事令人难以置信地印象深刻，达到了他们现在的规模。显然，你为那个生态系统贡献巨大，让硬件变得越来越好用。我们会在芯片方面、在封装GPU的系统方面看到英伟达生态系统的竞争者吗？

Yeah. So, I think we'll hit on a bunch of different things that I think will be of interest to our listeners. I figured one place to start would just be like a classic VC question, so you'll have to forgive The NVIDIA story over the past years has just been unbelievably impressive to get to the scale they're at. And obviously, you've contributed a tremendous amount to that ecosystem and making, you know, that hardware, you know, better and better for folks to use. Are we gonna see competitors to different parts of the NVIDIA ecosystem on chip side, on the kind of systems that package GPUs?

Speaker 0

显然，谷歌和亚马逊有自己的芯片，但对行业绝大多数其他公司来说，我相信这是你经常思考的问题。

Obviously, you've got, know, Google and Amazon with their own chips, but for the vast majority of the rest of the industry, I'm sure this is something you think about a lot.

Speaker 1

是的。我花了不少时间思考芯片问题。肯定有很多竞争者试图进入这个领域。AMD已经在这里一段时间了。显然，英伟达占据主导地位有几个原因。

Yeah. I spend a fair amount of time thinking about the chips. I think certainly there's lots of competitors are trying to enter this space. I mean, AMD has been here for a while. Obviously, NVIDIA is dominant for a couple of reasons.

Speaker 1

他们设计了非常好的芯片，并且开发了非常优秀的软件。这创建了一个生态系统，让人们在此基础上构建，开发软件。但我认为肯定会有很多竞争者进入这个领域。随着工作负载在架构方面开始更加集中，比如Transformer和MOE等，为这类工作负载设计芯片变得稍微容易一些，对吧？所以如果只专注于推理，我认为AMD有某些优势。

They design very good chips, and they build very good software. And that creates this ecosystem where people build on that, people build software on top of that. But I think certainly we'll see a lot of competitors entering the space. As the workload starts to be a little bit more sort of coalescing around on the architecture side, transformer and MOE and so on, it becomes a little bit easier to design the chips for that kind of workload, right? So if you just focus on inference, I think AMD has certain advantages.

Speaker 1

它们拥有更大的内存等等。所以我们开始看到人们选择这些产品。在训练方面，难度稍大一些。我的意思是，网络是主要瓶颈之一，而英伟达在这方面确实占据优势。但我认为人们都明白，构建优秀训练芯片面临的挑战，与构建优秀推理芯片的挑战是相似的。

They have larger memory and so on. So we're starting to see folks picking that up. On the training side, it's a little bit more difficult. I mean, networking is one of the main bottleneck, and NVIDIA is certainly a hat there. But I think people understand what are the challenges to build good training chips, are the challenges to build good inference chips.

Speaker 1

所以归根结底在于执行力。因此我认为这个领域实际上非常非常令人兴奋。我与许多正在设计新芯片的人交流，无论是推理芯片还是训练芯片。所以我预计在未来几年内，部分工作负载可能会变得多芯片化。它们可能会在不同的芯片上运行，而不是像现在这样——我估计大约90%都在英伟达芯片上运行。

So it comes down to execution. So I would say this space is actually really, really exciting. I talk to lots of folks who are designing new chips, either inference or training. So I would expect in the next couple of years, maybe some of workload will become multi silicon. They'll probably run on different chips rather than Right now, I'd say, I don't know, 90% on NVIDIA.

Speaker 0

您认为架构已经足够稳定了吗？因为显然，我认为芯片设计最吸引人的部分在于：您本质上是在对未来两三年推理工作负载和训练工作负载的形态下注。因此显然，对此越有把握，进行优化就越容易。您觉得现有稳定性是否足以支撑这些决策？还是说本质上就是有数十家初创公司和企业在各自下注，其中可能只有一两家能成功？

Do you think the architectures are stable enough? Because obviously, I think one of the fascinating parts of chip design is you're basically making a bet two, three years in the future of what inference workloads and training workloads will look like. And so obviously, the more certainty you have around that, the easier it is to make optimizations. Yeah. Does it feel like there's enough stability to make those bets, or is the idea that basically, look, there's dozens of startups and companies that are making bets, one or two of them may work out?

Speaker 1

没错，确实如此。我认为在架构方面，从很高层次来看，似乎已经围绕Transformer架构趋于稳定。但仔细审视后会发现，其实仍然存在不少变化。最近两三年的一个重要变化就是混合专家模型（Mixture of Experts）。通过这种方式，可以使模型变得更大、参数更多，但更加稀疏。

Right, right, yeah. So, I think on the architecture side, from a very high level, it looks like it has sort of stabilized around transformer. But once you look a little bit closer, there's actually quite a bit of change around that. So the most recent one in the last couple years has been mixture of experts. So where you make the model much larger, much more number of parameters, but sparser.

Speaker 1

是的，这需要某些权衡。您知道，需要更多内存，但可能计算量会稍小一些。即便如此，这也给某些芯片制造商带来了困难，因为他们可能原本是为稠密模型设计的——计算非常均匀，现在却要面对稀疏计算，设计难度就增加了。或者像注意力机制这样的组件也在变化——虽然注意力机制从2010年代就已经存在，至今已有十多年了。

Yeah. And so that has certain tradeoffs. You know, you need more memory, but maybe the amount of compute you do is a little bit smaller. And so even that, that prevents difficulty for some of the chipmaker, because maybe they were designing for a dense model where it's very uniform compute, now you have something sparse, so it's a little bit harder to design for. Or some of the stuff changing around, attention has been around since 20, for ten, ten plus years.

Speaker 1

但各处的变化确实让某些方面变得困难。比如DeepSeek提出的多头潜在注意力机制，就与传统注意力机制有所不同。例如他们的头维度非常大。如果您是系统设计师，矩阵乘法引擎有特定尺寸，可能就无法适配。一旦深入底层，就会遇到这类特殊问题。

But there are changes here and there that actually make some of this stuff difficult. So DeepSeek has this thing, multi head latent attention, looks a little bit different than attention. For example, they have very large head dimension. If you're a Sysdoc, the matrix multiplication engine has a certain size, maybe it doesn't fit. Weird things like that once you look underneath the cover.

Speaker 1

所以我认为这是架构层面的一个挑战。而在工作负载方面，我认为人们正在将这些模型用于非常不同类型的任务。我们是否有传统的聊天机器人？传统来说在过去几年...

So I would say that's one on the architecture. And then on the workload side, I think people are using these models for very different kinds of workloads. Do we have the traditional chatbot? Traditional has been in the last

Speaker 0

两三年...我知道您在这个领域深耕多年，算是传统AI了，对吧。

two, three I know you've been around for years as traditional AI Right.

Speaker 1

但现在出现了新的工作负载类型，比如Cursor和Windsurf这类编程辅助工具。还有智能体工作负载——不仅需要运行模型，还要进行工具调用、运行Python解释器、执行网络搜索等。这对芯片设计提出了新挑战。如果芯片过于聚焦于尽可能快地运行模型，就可能会忽略如何连接主机执行网络搜索这类功能。所以我认为，尽管从高层次看架构已趋于稳定，但底层仍然变化不断，部分原因在于工作负载本身也在演变。

But there are new workloads like this kind of coding workload like Cursor and Windsurf and so on. There's a kind of agent workload where it's not so much just running the model, but also making tool calls, running Python interpreter, doing web search and so on. That presents challenges of how you design chip. If the chip is very much focused on just running the model as fast as possible, you might neglect things like, oh, how you can connect to the host to run a web search, that kind of thing. So I would say even though from a high level, the architecture has kind of stabilized, but underneath the cover, there's still a lot going on, and part of it is the workload is also changing.

Speaker 1

因此核心在于能否快速执行并适应这些新型工作负载。

So it's a constant raise of how fast can you execute and adapt to this new workload.

Speaker 0

那么，如果现在90%的这些工作负载都在NVIDIA芯片上，你认为两三年后我们会处于什么位置？

So if 90% of these workloads are on NVIDIA chips today, where do you think we are in kind of two, three years?

Speaker 1

是的，我认为在推理方面，部分工作负载会多样化，我们开始看到像Cerebras、Grok和Seminova这样的公司真正构成了严峻挑战。他们正在推销说，嘿，我们能实现极低延迟的推理，对吧？这对某些用例来说非常棒。所以我们与那些非常、非常关心尽可能低延迟的客户交流，他们愿意为此支付更多费用，对吧？因此我认为我们开始看到更多这样的细分领域，有人非常关心低延迟，或者有些人非常关心大批量、高吞吐量的推理，比如大规模数据处理或合成数据生成或强化学习训练，这些场景需要尽可能多地生成轨迹。

Yeah, I think on the inference side, some of it will diversify, and we're starting to see companies like Cerebras and Grok and Seminova really presenting a serious challenge. Like, were pitching, hey, we can get very, very low latency inference, right? And that's fantastic for some of the use cases. So we talk to customers who really, really care about as fast, as low latency as possible, and they're willing to pay more for that, right? So I think we're starting to see more of these niche areas where people really care about low latency, or some people really care about very, very large batch, high throughput inference, those things like just massive data processing or synthetic data generation or RL training, where you have to just roll out and generate as many trajectories as possible.

Speaker 1

所以我会说它会多样化，仅仅因为工作负载将多样化。完全正确。无论是低延迟还是高吞吐量或其他什么，也许是视频生成，这将需要在计算需求和内存需求方面有不同的配置。

So I would say it's going to diversify simply because the workload are going to diversify. Totally. Whether low latency or high throughput or something else, maybe video generation that will require sort of a different profile in terms of how much compute do you need and how much memory do you need.

Speaker 0

如果你是一家初创公司，现在似乎很清楚，嘿，考虑到所有这些用例，这里有各种不同的优化方向。然后你决定，嘿，我们要做一款芯片来优化这些方面。你等待流片和实际生产能力，等到你完成这些，已经是几年后了。你认为不同的优化方向是否足够明确，以至于你可以做出这种前瞻性的赌注，还是存在这样的风险，实际上两三年后如果我们再做这期节目，你会说，嗯，实际上重要的优化是这五个我们没谈到的东西。

If you were like a startup today, it kind of feels clear in the moment, hey, here's all the different optimizations one might wanna wake given all these use cases. And then you kind of decide, hey, we're gonna make a chip that optimizes for these things. And you wait for tape out and the actual ability to produce, and by the time you've done that, it's a few years down the line. Do you think that it's clear enough what the different optimizations are gonna be such that you can kind of make that forward looking bet, or is there such a risk that actually two, three years from now if we're doing this episode, you're like, well actually the optimizations that matter are these five things that we didn't talk about.

Speaker 1

对，所以我会说，如果你是初创公司，你必须下注，对吧？你投资，你知道这一点，而且你必须下重注。所以，你可以赌，嘿，也许聊天机器人会消失，人们真正、真正关心的是，我不知道，视频模型、视频生成模型、世界模型，或者机器人技术之类的，对吧？你下那个注，然后说，好吧，那可能会占，我不知道，50%的工作负载。我们该如何为那种工作负载设计，对吧？

Right, so I would say, if you're a startup, you have to make a bet, right? You invest, you know this, and you have to make an outsized bet. So, you could bet that, hey, maybe chatbots will go away, and the thing that people really, really care about is, I don't know, video model, Video generation model, a world model, or robotics, or something, right? And you make that bet, and you say, okay, that's probably gonna be, I don't know, 50% of the workload. How would we design for that workload, right?

Speaker 1

你希望你的赌注是正确的。我的意思是，这就是初创公司的角色，对吧？我认为这可能是你产生影响的方式。我认为如果你不 just，下注，而是说，我只优化通用工作负载，那么现有巨头就会在执行上超越你。

And you hope that your bet will be correct. I mean, that's the role of startup, right? And I think that's probably the way you could make an impact. I think if you if you don't just, place a bet, you say, I'm just gonna optimize for just the general workload, then the incumbents are gonna just out execute you.

Speaker 0

我很兴奋我们今天做这期节目，因为昨天我像往常一样刷Twitter，我觉得Semi Analysis有一条非常有趣的推文，是对你最近一些工作的反应。显然，你在优化NVIDIA芯片上的模型方面做了这么多。我感觉，你知道，有人可能试图量化你为更广泛的NVIDIA生态系统增加的价值，但这显然是巨大的。而且我认为Semi Analysis的那些人像是说，天啊，你为这个生态系统增加了这么多价值，为什么AMD或，你知道，谷歌或其他一些有芯片的公司不尝试付钱让你在那些领域做类似的优化？这让我，你知道，显然想起了VLM世界，在那里，你知道，你看到为那些真正能够构建和训练这些模型的少数研究人员提供的大规模包裹。

I was excited we're doing this episode today, because yesterday I was just scrolling Twitter as I'm lot to do, and I thought there was this really interesting tweet from semi analysis and like reaction to some of your your recent work. So obviously, you've done so much on optimizing models on on NVIDIA chips. And I feel like, you know, one could try and quantify the value you've added to the to the broader NVIDIA ecosystem, but it's clearly like massive. And I and I think the semi analysis folks were like, God, like you've added so much value to this ecosystem, why wouldn't AMD or, you know, Google or some of the other folks that have chips like try to pay you to get you to do similar optimizations in those space? And it kind of, you know, obviously, it reminded me of of VLM world where, you know, you've seen kind of massive packages for the few researchers that are really able to build and train these models.

Speaker 0

我想知道你对那条推文的反应是什么，以及你总体上对硬件领域类似动态的看法。

I wonder what your reaction was to that tweet and generally what you think about similar dynamics in the hardware space.

Speaker 1

是的，是的，我个人与来自NVIDIA、AMD、谷歌和亚马逊等公司的很多人合作。我花了很多时间在NVIDIA芯片上，仅仅因为这是我们现在拥有的。这是大多数人使用的。我认为他们设计了非常好的芯片，并为它们设计了非常好的软件。所以这让我能够做一些有趣的事情。

Yeah, yeah, I personally work with lots of folks from NVIDIA, AMD, Google, and Amazon, and so on. I spend a lot of my time on NVIDIA chips simply because this is what we have right now. That's what most people use. I think they design very good chips, and they design very good software for that. So that allows me to sort of do interesting things.

Speaker 1

你知道，这就是我追求的，比如你能做有趣的事情吗？我们曾与AMD的人合作过，例如，他们有一个版本的Flash Attention，我们与他们合作将其集成到公共代码库等等。所以我们肯定与其中一些公司合作过。我不知道最好的安排是什么，但现在我更多思考的是我们需要什么样的抽象，不仅仅是为了NVIDIA芯片，而是为了GPU和加速器整体。我花了很多时间在最低层，试图弄清楚如何从这些芯片中获得最大性能。

You know, that's kind of what I'm after, like can you do interesting things? We've worked with the folks at AMD on, for example, they had a version of Flash Attention, we worked with them to sort of integrate that the public repo and so on. So certainly we worked with some of them. I don't know what the best arrangement would be, but nowadays I'm thinking a lot more about what are the abstractions we need, not just for NVIDIA chips, but for sort of GPUs and accelerators in general. I spent a bunch of time at a lowest level trying to figure out, get the maximum performance out of these chips.

Speaker 1

但随着我们共同扩展规模，我们会思考如何让新招聘的员工也能高效工作，对吧？其中一部分是构建能在英伟达芯片以及可能其他芯片上运行的抽象层。另一件让我兴奋的事情是，我们应该构建什么样的抽象层，让AI能为我们完成部分工作。我认为答案尚不明确，但作为人类和技术领导者，关键在于构建正确的抽象层，以便其他人能够非常快速地融入，让你所做的工作能够跨不同芯片或不同工作负载运行。

But as we scale together, we think about, oh, how do we get other people that we hire to get productive, right? And part of it is building abstractions that will work on NVIDIA chips and potentially other chips as well. The other thing I'm excited about is what kind of abstractions do we build so that AI can do some of this work for us. And I think the answer isn't quite clear yet, but I think as humans, as technical leaders, it's about building the right abstraction so that other people can get onboarded really, really quickly, so that the kind of things you do could work across different kind of chips or different kind of workload.

Speaker 0

你认为我们现在是否已经拥有能够跨不同芯片组进行工作的抽象层？

Do you think we're at a place where there are abstractions that do allow you to do work across the different set of chips out there?

Speaker 1

我认为我们有一些，但这里存在常见的权衡，对吧？Triton非常出色，它支持英伟达芯片、AMD GPU和英特尔GPU等，这需要他们设计一种前端，然后有后端，不同公司会为后端贡献代码。我认为Trident实际上相当不错，很多公司都在押注它。例如Meta的PyTorch，TorchCompile会直接生成Triton代码，并依赖Triton为英伟达或AMD生成代码。但这就是常见的权衡：如果你不控制最低层级，可能会牺牲一些性能。

I think we have some, right, but there's a usual trade off, right? Triton is really, really good, and it supports NVIDIA chips, supports AMD GPU and Intel GPU and so on, and that requires them to design sort of a front end, and then there's a back end, and different companies will contribute code to the back end. I think Trident's actually quite good, and a lot of companies are betting on that. Meta, for example, PyTorch, TorchCompile, will just generate Triton code and rely on Triton to generate code for NVIDIA or AMD. But that's the usual trade off, which is, hey, if you don't control at the lowest level, maybe you give up some performance.

Speaker 1

关键在于这个权衡点在哪里。如果你牺牲5%的性能但能获得三倍的生产力提升，那当然是可接受的权衡。但如果你牺牲太多性能，那么人们可能会选择更……

And it's about where that trade off point is. If you give up 5% performance and you become three x more productively, for sure, that's an acceptable trade off. But if you give up too much performance, then maybe people would go with something that's a little bit more

Speaker 0

尤其是在竞争相当激烈的推理领域，对吧。

Especially a pretty competitive inference Right.

Speaker 1

所以我认为为人类设计真的非常困难。我会说硬件可移植性某种程度上是个神话。简单来说，即使对于我们的英伟达芯片，代际之间变化也很大。我认为这是我们能从这些芯片中获得更多性能的唯一方式。它不像CPU那样，年复一年可能只快5%到10%，而你旧的代码还能勉强运行。

So I'd say designing for humans is really, really hard. I think, I would say hardware portability is kind of a myth. Simply, you know, even for our NVIDIA chips, like generation to generation, they change a lot. I think that's kind of the only way we get more performance out of these chips. It's not like a CPU where, oh, year to year maybe you get a five to 10% faster, and your old code just kind of works.

Speaker 1

即使对英伟达来说，每一代他们都基本上需要重写所有代码。是的。因为获得更多浮点运算能力的方法是构建更专业的组件，可能是更低精度，可能是芯片不同部分之间不同的同步方式，对吧？所以英伟达基本上每两年左右也要重写他们的软件。因此，硬件可移植性，即使是代际之间，也还没有完全实现。

Even for NVIDIA, every generation, they essentially have to rewrite all the code. Yeah. Because the way to get more flops is you build more specialized components, maybe lower precision, maybe a different way to synchronize between different parts of the chips, right? So NVIDIA also essentially rewrite their software every two years or so. So even hardware portability, even between generations, isn't quite there.

Speaker 0

所以这些抽象层即使只是帮助你在同一制造商的每一代后续芯片上工作，也很有价值。

So those abstractions would be valuable even just to help you with each subsequent generation of chips from the same manufacturer.

Speaker 1

对，没错。所以我认为Trident有非常引人注目的抽象层，他们甚至还有一些更低层级的工具，比如我认为非常新的Gluon，它以可能通用性较低为代价，更多地暴露硬件细节，对吧？Modular团队一直在构建这种叫做Mojo的语言。是的，是的。

Right, yeah. So I think Trident has very compelling abstractions, and they even have something that's a little bit lower level, things like, I think this is super new, it's Gluon, where it exposes more of the hardware at the cost of maybe being less general, right? The folks at Modular has been building this language called Mojo. Yeah, yeah.

Speaker 0

你对他们正在做的事情有什么看法？

What do you think of what they're doing?

Speaker 1

我认为这非常酷。我觉得他们有一些正确的抽象概念。我认为关键在于执行，因为你知道，人们会关注它，然后他们会问，你们在英伟达芯片上的速度有多快？这在某种意义上是一个不太公平的问题，但你知道，这是人们关心的。所以他们设计抽象概念，然后必须做一些定制工作，让他们的代码在英伟达芯片上运行得非常非常好，然后对AMD芯片做一点定制，对吧？

I think it's super cool. I think they have some of the right abstractions. It's just about execution, I would say, because, you know, people are gonna look at it, and they're gonna be like, okay, how fast are you on NVIDIA chips? Which is a really sort of, in some sense, unfair question, but, you know, that's what people care about. So they would design the abstractions, and then they have to do some custom work to make their code run really, really well on NVIDIA chips, and then they do a little bit of custom for AMD chips, right?

Speaker 1

所以这就像你愿意做多少定制来换取性能。

So it's just like how much do you want to do the customization for just trading off performance.

Speaker 0

因为市场要求最好的推理性能

Because you see the market demands the best inference

Speaker 1

在英伟达上，我们会看到越来越多这样的库或领域特定语言。斯坦福的人有Thunder Kittens之类的尝试来抽象其中一些。还有Thailand。谷歌有同样的Mosaic GPU。我肯定我忘记了一些，对吧？

And on NVIDIA we'll see more and more of these libraries or domain specific language. Know, folks at Stanford have the same Thunder Kittens sort of trying to abstract some of this. There's Thailand. Google has the same Mosaic GPU. I'm sure I'm forgetting some, right?

Speaker 1

但我认为人们意识到一个问题，就是我们还没有正确的抽象概念。训练新工程师编写高性能GPU内核实际上相当痛苦。所以答案是构建抽象概念。我认为我们正处于一个迭代非常非常快的阶段。这就是为什么我们看到这么多领域特定语言出现，对吧？

But I think people realize that there's a problem of we don't have the right abstractions yet. It's actually kind of painful to train new engineers to write very performant GPU kernels. So the answer is to build abstractions. And I think we're very much in the phase where we're iterating pretty, pretty quickly. That's why we see so many of these domain specific language coming out, right?

Speaker 1

然后随着AI模型变得越来越好，我正在思考的一个问题是如何为语言模型设计领域特定语言或抽象概念？因为它们的运作方式与人类有些不同。我们不知道正确答案是什么。所以我认为在未来一两年内，这会变得清晰很多。现在，是的，每个人都在尝试很多不同的方向。

And then as the AI model's getting better, one of the things I'm thinking about is how do we design domain specific language or abstractions for language models? Because they operate somewhat differently from humans. We don't know what the right answer is. So I would say in the next one or two years, I think it's gonna become a lot clearer. Right now, it's just, yeah, everybody is trying lots of different directions.

Speaker 0

你认为这些抽象概念最可能来自哪里？

Where do you think those abstractions are most likely to come from?

Speaker 1

是的，我认为这种抽象概念，人们从两个角度来接近它。一个是从机器学习方面。他们思考，好吧，我们有什么样的工作负载，表达这些工作负载需要哪些原语？所以很多工作负载，比如推理，很大程度上是如何尽可能快地移动内存？推理通常是内存受限的，对吧？

Yeah, I think this abstraction, I would say people approach it from two angles. One is from sort of the machine learning side. They think about, okay, what kind of workload do we have, and what are the primitives that are necessary to express those workloads? So a lot of workloads, let's say inference is very much how can I move memory as fast as possible? Inference is usually memory bound, right?

Speaker 1

或者如何尽可能快地进行矩阵乘法，以及我想在那里表达哪些原语？另一个角度是人们从硬件方面思考。他们在芯片上有非常酷的专用组件，他们正在思考如何暴露这些抽象概念的抽象方法。英伟达一直非常擅长的是设计更异步的芯片，因为矩阵乘法变得如此之快，以至于其他一切都变得慢得多。所以重叠矩阵乘法和所有其他计算变得非常重要。

Or how do I do matrix multiply as fast as possible, and what are the primitives I want to express there? The other angle is people thinking sort of from the hardware side. They have really cool specialized components on the chips, and they're thinking about abstractions on how to expose these abstractions. One of the things NVIDIA has been really good at is designing chips that are more asynchronous, because the matrix multiplication is getting so fast, it's ridiculous, that everything else becomes much slower. So it's much, much more important to overlap the computation of matrix multiplication and everything else.

Speaker 1

那么如何设计允许这种异步执行的抽象概念，比如流水线、同步等等？所以我认为抽象概念将来自工作负载方面或硬件方面。我认为可能在一两年内会变得清晰很多。

So how do you design abstractions that allow this kind of asynchronous execution, things like pipelines and synchronizations and so on? So I'd say that's where the abstractions are going to come from, either from the workload side or from the hardware side. I think it will become a lot clearer in maybe one or two years.

Speaker 0

嗯，你多次提到这一点，所以我必须问你。你谈到了让这些抽象可能适合LLM，以及在流程本身中使用AI。你在多大程度上使用AI本身来确定这些东西？以及你认为未来几年这会如何变化？

Well, you've alluded to a few times, so I have to ask you. You talked about kind of making these abstractions suitable for LLMs maybe, as well as kind of just using AI in these processes themselves. How much are you using AI itself in determining this stuff, and how do you kind of see that changing over the next few years?

Speaker 1

是的，是的。我认为模型开始在这方面变得有用了。实际上，就在最近。这真的让我很惊讶。在某种程度上，人们追求的是完全自动化的GPU内核编写。

Yeah, yeah. I think models are starting to become useful for this stuff. Actually, very, very recently. This really surprised me. On one level, what people are after is fully automatic GPU kernel writing.

Speaker 1

你描述问题，语言模型就会为你生成内核。也许我们在其他一些领域已经做到了，比如简单的Python脚本、数据分析或Web前端，对吧？是的，你可以做到，对吧？那么我们能为GPU内核做到这一点吗？

You describe the problem and the LM would just generate the kernel for you. Maybe we're there with some of the other areas like, I don't simple Python script or data analysis or web front end, right? Yeah, you can do this, right? So could we do that for GPU kernel?

Speaker 0

直接写一些内核代码。

Just five code some kernels.

Speaker 1

对，对。我想说，如果那是你想要的，那么我们还在非常非常早期的阶段。这些模型可以生成一些简单的内核。比如逐元素操作，你输入一个数组，对每个元素进行一些操作。或者进行一些归约操作，比如求和和一些归一化之类的。

Right, right. I would say if that's what you want, then I think we're very, very early. So these models can generate some simple kernels. Like element wise, you take in an array and you do some operation on each element. Or you do some kind of reduction, like summing and some normalization, things like that.

Speaker 1

所以模型可以做得相当不错。但一旦变得稍微复杂一点，这些模型就无法生成正确的代码。我认为这只是因为我们没有足够的训练数据。训练数据真的非常非常难获取，因为如果你从网上爬取内核代码，你会得到，我不知道，可能是课堂作业。你会得到可能是三代以前的GPU文档。

So the models can do a reasonably good job. But once it gets a little bit more complicated, these models just don't generate correct code. I think it's just a function of we don't have enough training data. Training data is really, really tough for this stuff, because if you scrape the internet for kernel code, you'll get, I don't know, class projects. You'll get documentations that were meant for GPUs maybe three generations ago.

Speaker 1

这些都是你现在不应该做的事情。所以训练数据真的非常困难。所以我认为答案是，你可能需要从一些专家级数据开始，然后从中生成合成数据，或者连接到像编译器和性能分析器这样的工具，这样你就可以获得大量的训练数据，或者获得正确的环境，对吧？我认为这在一两年内会得到解决，但这确实是一个难题。

These are things you shouldn't be doing now, kind of thing. So training data is really, really difficult. So I think the answer to that is you probably have to start with some expert level data, and then you generate synthetic data out of that, or hooking up to a tool like Compiler and Profiler so that you can get lots of training data or getting the right environment, right? I think it will be solved in a or two, but it's certainly a difficult problem.

Speaker 0

谁拥有这些数据？

Who has that data?

Speaker 1

我不认为，你知道，这类数据是私有的。有几个地方有专家级的代码，但我认为更多的是关于工作流程。你如何从少量的专家级数据开始，生成大量的合成数据？对吧？所以我认为Discord GPU Mode的一些人一直在努力做这件事。

I don't think, you know, this kind of data, I don't think is private. There are a couple of places where you have expert level code, but I think it's more about sort of the workflow. How do you start from a small amount of expert level data and generate lots of synthetic data? Right? So some of the, I think the folks at this Discord GPU Mode, they've been really trying to do this.

Speaker 1

他们采用了编译器，PyTorch编译器，它可以将PyTorch代码转换为Triton代码，这是一种更低级的内核代码。他们可以生成，我不知道，我想是15,000对这样的PyTorch和Triton程序。你需要有点创意。我认为互联网上没有那么多数据，所以你需要有点创意，思考如何生成这类数据。所以我认为这是一个角度。

So they took compiler, the PyTorch compiler, which go from PyTorch code to Triton code, which is this lower level kernel code. And they could generate, I don't know, I think 15,000 pairs of these program of PyTorch and Triton. You have to be a little bit creative. I think there's just not that much data on the internet, so you have to be a little bit creative of how you're going to generate this kind of data. So I'd say that's one angle.

Speaker 1

如果你想要完全自动化的内核生成，我认为我们还处于非常早期的阶段。另一方面是，它能否与人类协同工作？模型能否与人类协作？而我惊喜地发现这些模型实际上相当有用。是的，

If you want fully automatic kernel generation, I'd say we're super early there. The other side is, can it work alongside humans? Can models work alongside humans? And I've been pleasantly surprised that these models are actually quite useful. Yeah, was

Speaker 0

有没有某个特定时刻让你觉得，哇，这些模型已经足够好用了？确实相当有帮助

there a specific moment where you're like, Woah, these models have gotten good enough to be Right. Quite helpful to

Speaker 1

是的，是的。我想说最近可能有两个里程碑。一个是O3。O3在推理方面变得非常出色。你知道，有些问题我同时用O3和GP5进行头脑风暴时，我会问：嘿，我有这个函数，该怎么优化它？

Yeah, yeah. I would say maybe there were two recent milestones. One was O3. O3's gotten really good at reasoning. You know, some of the questions I was brainstorming with O3 and GP5, I was like, hey, I have this function, like how would I optimize it?

Speaker 1

我应该关注哪些方面？它在高层次思考上出奇地好。另一个是Cloud Code。不知为何，它在编写trident内核方面实际上相当不错，这太棒了。尽管我很喜欢编写内核，但我大部分时间都在思考设计，思考我们应该设计什么样的架构来充分利用硬件？

What are the things I would pay attention to? And it's surprisingly good at the high level. The other one is Cloud Code. Somehow, it's actually pretty decent at writing trident kernels, which is fantastic. As much as I love writing kernels, a lot of my time is thinking about the design, thinking about, oh, what kind of architecture should we design so that we can take advantage of the hardware?

Speaker 1

所以实现部分是这样的：设计真的很有趣，但实现通常相当繁重。

And so the implementation part of it is, design is really fun, but the implementation is usually quite heavy.

Speaker 0

是的。

Yeah.

Speaker 1

对吧？而Cloud Code在这方面被证明相当有帮助。就像，我会说，它可能让我的生产力提高了1.5倍。

Right? And cloud code turned out to be quite helpful here. Like, I would say, you know, it makes me maybe 1.5x more productive.

Speaker 0

哇，我的意思是，这确实相当

Wow, I mean, that's Yeah, pretty

Speaker 1

所以我一直是Cloud Code的重度用户。如果这些模型能与人类协同工作，也许它们会比完全依赖它们自动生成要有帮助得多

so I've been a heavy user of Cloud Code. If you have these models working alongside with humans, maybe they are a lot more helpful rather than just relying on them fully automatically generating What

Speaker 0

你在等待的下一个里程碑是什么？猜想当新模型发布时，你会测试哪些方面？你会想，天啊，如果模型能达到这个阶段，我的效率就能从1.5倍提升到2倍。

are the next milestones you're waiting for? Guess when a new model comes out, are there things you test, you're like, God, if the models could just get to this stage, I'd go from 1.5 to two x.

Speaker 1

没错。所以，我认为ClawCode是一个很好的阶跃式变革的例子，它更具代理性，对吧？某种程度上，他们通过后训练让Cloud在这方面表现非常出色。我相信其他公司，如OpenAI和谷歌，很快也会达到非常相似的水平。这里的'代理性'只是指它能很好地使用工具，并且知道何时使用这些工具。

Right. So, would say, I think ClawCode is a good example of kind of a step change, where it's a lot more agentic, right? And somehow, you know, they post train Cloud to do really well there. And I'm sure other folks, OpenAI and Google, will soon get to a very, very similar point. So agentic here just means it can use tools really well, and it knows when to use the tools.

Speaker 1

所以它知道，'嘿，我可能没有正确的API，那我该如何查找API？'或者'程序编译不通过，或者运行速度不够快，我该如何从性能分析器获取信息？'诸如此类。因此对于新模型，我会关注它们是否清楚自己不知道什么。比如它们何时需要寻找新信息？

So it knows that, hey, oh, I'm probably not having the right API, so how do I look up the API? Or, oh, the program is not compiling, or the program's not fast, how do I get information from the profiler? That sort of thing. So I think I would say for new models, I would see how well they know whether they don't know. Like when do they need to seek out new information?

Speaker 1

这有点模糊。我认为人们开始为这类代理能力制定基准测试，但我们在这方面仍处于非常早期的阶段。

And that's kind of a vague thing. Think people are starting to come out with benchmarks for this kind of agentic capabilities, but we're still super early there.

Speaker 0

自ChatGPT发布以及这些模型更广泛使用以来，您一直是AI推理市场诸多改进的关键推动者。所以我想或许为我们的听众做个背景介绍会很好，回顾这三年是什么推动了在降低成本和改善延迟方面的大量改进，我也想确保给您机会谈谈FlashAttention等工作的

You've been such a seminal part of so many of the improvements in the AI inference market since the launch of ChatGPT and the broader usage of a lot of these models. So I figured maybe just for our listeners, would be great to contextualize these last three years and what's driven so much of the improvements on the kind of lowering costs and improving latency side, and I wanna make sure to give you the opportunity to talk about a lot of the flash tension work as

Speaker 1

当然，当然。是的，我认为在过去几年里，推理成本可能下降了约100倍？

well. For sure, for sure. Yeah, I think in the last couple of years, inference costs has probably come down maybe 100x?

Speaker 0

是的，至少。

Yeah, at least.

Speaker 1

自ChatGPT首次亮相以来，我认为这也反映在API成本上。一方面，在建模方面，人们用相同数量的参数训练出了好得多的模型，对吧？部分原因是使用了更多数据，部分我认为是更好的架构，比如混合专家模型(MOE)肯定有帮助。

Since ChatTripity debut, I think it's probably reflected in the API cost as well. So on one size, on the modeling, people have just trained much, much better models for the same sort of number of parameters, right? Part of it is much more data. Part of it is, I think, better architecture, right? MOE certainly helps.

Speaker 1

人们发明了各种高效注意力机制等等，对吧？这是建模方面。模型在参数量相同的情况下变得好得多。另一方面是推理优化。我们在这方面看到了技术的爆炸式增长，对吧？

People invent new sorts of efficient attention and so on, right? So that's on the modeling side. The model's getting much, much better for the same number of parameters. The other side is on the inference optimization. And we've just been seeing an explosion of techniques there, right?

Speaker 1

我认为早期，我们整体上并不理解推理的瓶颈是什么。一段时间后，人们意识到，嘿，很大程度上是关于数据移动——将权重移入移出内存，移动KV缓存（这是注意力操作中需要存储整个历史以进行下一个预测的部分），对吧？所以很多优化围绕如何减少数据移动？比如可以对模型进行量化，对吧？通常，也许两三年前，每个参数使用16位是常态，但现在8位相当普遍，新模型甚至使用4位。

I think in the early days, we didn't, as a whole, we didn't understand what are the bottlenecks of inference. And after a while, people realized, hey, it's a lot about data movement, moving the weights in and out of the memory, moving the KV cache, which is the part of the attention operation that you need to store the history, entire history to sort of make the next prediction, right? So a lot of optimizations are around how do we reduce the data movement? So things like you could quantize the model, right? So you use, instead of typically, I think maybe two, three years ago, it was typical to use 16 bits per parameter, but now people are doing eight bits is pretty common now, and so new models are using four bits.

Speaker 1

有些工作使用1或2位，相当疯狂。但你看到了这种引人注目的权衡。当你量化模型时，很多时候几乎不会损失任何质量，这需要相当复杂的技术，但效果非常好。所以我认为OpenAI最新的发布，gbdOSS，他们发布的大部分层都是4位的，对吧，这很疯狂。它有，我想，是1200亿个参数，但因为每个参数只用4位，你可以把它塞进60GB左右的空间，对吧？

There's some work using one or two bits, is pretty crazy. But you see this compelling trade off. As you quantize the model, a lot times you pretty much don't lose any quality, and that requires fairly sophisticated techniques, but that has worked amazingly well. So I think the most recent OpenAI release, gbdOSS, they released most of the layers are in four bits, right, which is crazy. Have, I think, was 120,000,000,000 parameters, but because each is using four bits, you could fit that in 60 gigabytes or something, right?

Speaker 1

这已经转化为非常、非常好的推理性能。所以量化是一方面。另一方面是现代架构与硬件的协同设计，因为人们理解得越来越深入，本质上双方人员之间的交流也更多了。我认为Flash Attention就是一个例子，其中既有非常懂硬件的人，也有非常懂算法和模型的人。但当这些想法结合起来时，你会开始看到，哦，我确切地知道硬件端的瓶颈是什么，那么我该如何改变算法呢？

And that has translated to really, really good inference performance. So quantization is one. The other one is this co design of modern architecture and the hardware, as people understand much, much better, essentially people on both sides are talking to each other a lot more. I think flash attention was kind of like that, where there were people who understand the hardware really, really well, there were people who understand the algorithm and the model really, really well. But as you combine these ideas, you start to see, Oh, I see exactly what the bottleneck on the hardware side is, so how do I change the algorithm?

Speaker 1

对于Flash Attention，我们意识到内存访问是主要瓶颈，并找到了一种重写注意力算法以减少内存访问的新方法。我们在推理中看到了很多这样的例子。其中一个例子是DeepSeek的多头潜在注意力变体，他们意识到，嘿，很多时间都花在压缩和移动KV缓存（即历史摘要）上，对吧？他们发现，嘿，通过他们称之为潜在投影的方法，可以在一个更小的空间进行投影，从而使KV缓存变得小得多，对吧？结果证明这非常有效。而且DeepSeek在论文中表示，他们能够非常高效地服务这个模型。

And for FlashIntention, we realized memory access was the main bottleneck, and we figured out a new way to rewrite the attention algorithm to reduce memory access, And we're seeing that a lot for inference. So an example is this variant of attention from DeepSea called multi head latent attention, right, where they realize, hey, lot of time is being spent on compressing moving the KV cache, which is a summary of the history, right? And they realize, hey, we can make the KV cache much, much smaller by they call it this latent projection where you project on a much smaller space, right? And that turns out to work really, really well. And DeepSeek, I think in the paper, they said they were able to serve this model very, very efficiently.

Speaker 1

这是一个例子。另一个是混合专家模型（MoE），对于每个查询、每个标记，你并不使用模型中的所有参数。存在一定的稀疏性，所以你只使用模型的一部分。在过去几年里，我认为趋势是让模型越来越稀疏。我想Mistral的第一个开源MoE模型大约是一两年前，他们在每层激活了八个专家中的两个，对吧？所以比例是25%。但现在对于DeepSeek和OpenAI的模型，比如OpenAI的GPT-4，我认为是激活了128个中的四个之类的，对吧？

So that's one example. The other one is a mixture of experts where you don't, for each query, for each token, don't use all of the parameters in the model. There's some sparsity, so you use some part of the model, and in the last couple years, I think the trend has been making the models sparser and sparser, I think for the first open source MOE model from Mistral was maybe a year, two years ago, they activated two experts out of eight experts per layer, right? So it's a ratio of 25%. But now for DeepSeek and OpenAI model, you activate, I think, OpenAI GPOSS, it was four out of 128 or something, right?

Speaker 1

比例是32倍。我们正变得越来越稀疏。所以，如果你服务大量用户，这真的非常非常好。因此，人们真正理解了工作负载，他们真正在将架构与推理进行协同设计。我认为很多改进都来自于此，即模型与推理栈之间的这种协同设计。

It's a factor of 32. And we're getting sparser in sparser. So that's really, really good for if you serve lots and lots of users. So people are really understanding the workload, they're really co designing the architecture with the inference. And I would say that's where a lot of improvement has come from, is this co design between model and the inference stack.

Speaker 0

那么展望未来，你认为改进将继续来自哪里？

And then looking forward, where do you think it's gonna continue to come from?

Speaker 1

对。所以我认为还会有大约10倍的提升。对吧？我的意思是，如果你只是

Right. So I think there's gotta be another 10x or so. Alright. Right? Mean, if you just

Speaker 0

到目前为止似乎就是这个趋势。

Seems to be the trend so far.

Speaker 1

对，外推一下。

Right, extrapolating.

Speaker 0

我只是不确定我们是否已经摘完了低垂的果实，对吧，或者是否还有

I just didn't know if we'd picked off the low hanging fruit, right, or if there's still

Speaker 1

对。我认为，你知道，作为一个整体，我们确实已经摘了很多低垂的果实。但我认为还有很多事情要做。其中之一实际上是在硬件方面，有一段时间因为无法真正预测两年后的工作负载会是什么样子，所以很难进行很好的专门化。但随着架构变得稍微稳定一些，我认为设计芯片的人现在正在为推理进行优化，比如低精度，获得真正好的原生硬件支持。

Right. I think it's, you know, we've done, as a whole, we've picked off a lot of low hanging fruits for sure. But I think there's still lots to do. One is actually on the hardware side, which is for a while, because you couldn't really predict what the workload is going to be two years from now, so you can specialize very well. But as the architecture has gotten a little bit more stable, I think people designing chips are optimizing for inference now, So low precision, get really good native hardware support.

Speaker 1

所有网络连接变得非常好，因为随着模型变大，像专家混合这样的技术，模型规模增大但不会激活所有参数。你可能需要多个GPU或芯片来存储模型，这时网络连接就变得重要得多。所以在硬件方面，我认为一年内可能会有两到三倍的提升。在软件方面，我认为有几个方向。一是推动模型架构的创新。

All the networking gets really good because as the model gets larger, things like mixture of experts, the models get larger, but you don't activate all the parameters. You might need multiple GPUs or chips to store the model, and then networking becomes much more important. So I think on the hardware side, would probably be two, 3x just in one year. And on the software side, I think a couple of things. One is pushing on the model architecture.

Speaker 1

所以我研究过像Mamba这样的技术，它不是将整个历史作为KV缓存存储，而是让模型将历史压缩成更小的状态向量。当然这会有一些权衡取舍，但我们在这方面取得了相当不错的成果，我认为这对大批量推理尤其重要，这正是当前一些工作负载的需求，比如测试时计算和推理，你真正希望的是模型能同时探索多条轨迹或思维链。我想谷歌提出的Gemini DeepThink就是这个理念，他们在国际数学奥林匹克竞赛中获奖就是通过同时探索多条路径。这意味着你需要同时进行数百个序列的模型推理，这时KV缓存就成了一个更大得多的问题。

So I've worked on things like Mamba, it's a way to sort of instead of storing the entire history as a KV cache, you could have the model compress that history into a smaller state vector. Of course, that's going to have certain trade offs and so on, but we've seen pretty good success with this, and I think this is going to be especially important for large batch infants, which is what people are, some of the workloads are demanding these days, where things like test time compute and reasoning, like what you really, really want is you want the model to explore lots of trajectories or chains of thought at the same time. I think Google came up with this thing, Gemini DeepThink, I think, which is that was the idea. I mean, were winning IMO gold medals and so on, which is they explore multiple paths at the same time. And what that means is that you're doing model inference on hundreds of sequences at the same time, and there, the KBCash becomes a much, much bigger problem.

Speaker 1

所以你可以尝试让模型压缩KV缓存测试。这就是我们开发Mamba的方向，基于这个理念已经出现了许多其他变体。所以在模型方面，我认为会有两到三倍的提升。对吧？

So you could try and get the model to sort of compress that KBCash test. That was sort of the direction we were going with Mamba, and there's been a bunch of other variants coming out along that idea. So I think on the model side, it's going to be two, three X. Yeah. Right?

Speaker 1

然后在内核层面，很多人对内核方面非常感兴趣，其中一些人非常有天赋。所以我们正在开发非常优秀的内核。这可能还会带来约2倍的提升。综合来看，我认为即使仅仅一年内，我们可能还能获得10倍的提升。

And then on just kernel level, lots of people are getting really interested in the kernel side, and some of them are very, very talented. So we're getting very, very good kernels. And that's probably another 2x or so. So taken together, I think even in probably just one year, we would probably get another 10x.

Speaker 0

让我印象深刻的是，随着使用场景的多样化，人们想要运行推理的内容以及这些模型的架构，似乎已经扩展到许多不同的最终使用场景，每个场景的瓶颈或你想要进行的优化都取决于具体情况。

One thing I'm struck by is obviously with this broader diversifying of use cases and how people are, what they wanna run inference on and the architecture of these models, it seems like it's kind of expanded into, there's many different end use cases and the bottleneck of each of those or the optimization you wanna make each of those are depending on.

Speaker 1

完全正确。

Exactly.

Speaker 0

那么，这对作为一个推理服务提供商来说意味着什么？显然你想要在所有方面都做到最好。你认为生态系统会继续保持由一个在所有方面都很出色的供应商主导，还是随着时间的推移会出现专业化分工，比如如果你想要进行深度研究，有一个专门擅长这个的推理提供商，而如果你想要运行更像聊天机器人的使用场景？你怎么看待这种发展趋势？

And so, I guess what are the implications of that then maybe to bring back together as like an inference provider? Obviously, you want to be the best at all of those things. Yeah. Do you think that like the ecosystem remains with one vendor who is really good at all those things or do you see like specialization over time where it's like, oh, hey, cool. If you're trying to run a deep research, there's one inference provider that's amazing at that, and if you're trying to run like more of a chatbot use case, how do you see that kind of playing out?

Speaker 1

对，是的。我认为可能会有三种工作负载模式，所有推理服务提供商都会理解这一点并为此进行优化。但规模化运行有某些优势，对吧？在工作负载方面，有传统的聊天机器人，需要交互性的平衡，不能太慢但也不需要超级快。

Right, yeah. I think there will probably be maybe three kinds of workload patterns, and I think all inference providers will understand this, and they will optimize for it. But there's certain advantages to running at scale, right? So on the workload side, there's the traditional chatbot where you need a mix of interactivity, so you can't be too slow, but you also don't need to be super fast.

Speaker 0

有时候如果回复太快反而有点诡异。你希望体验上感觉至少另一端有个人在稍微思考一下

Sometimes it's a little creepy if it comes right away. You to want feel like for the experience that there's someone at least to be a little bit of thicky on the

Speaker 1

另一端。对，对。所以有那种工作负载。还有那种对延迟要求非常低的情况。比如我花了很多时间使用Cloud Code，我总是希望，如果这个能快3倍或5倍，我当然愿意付更多钱，对吧？

other end. Right, right. So there's that kind of workload. There's the kind of really low latency requirements. Things like, I spent a lot times with Cloud Code, so I always wish, hey, if this is 3x faster or 5x faster, I'm certainly willing to pay more, right?

Speaker 1

因此我们会看到更多这样的情况，这是一种非常互动的模式，如果模型速度提升2倍，或者推理速度提升2倍，那么使用模型的人工作效率也会提升2倍。

And so we'll see more of that, it's kind of very interactive, where if the model is 2x, or if the inference is 2x faster, the person working with the model is gonna be 2x more.

Speaker 0

是的，这样每个人都能保持心流状态，

Yeah, so everyone's gonna staying in flow state and

Speaker 1

没错，没错。我的意思是，人们正在通过运行多个模型来解决这个问题，同时运行四个云端代码。但对我这种偏好深度工作的人来说，我通常只用一个模型，为此我搭档经常说我。她总说你应该同时使用四个云端代码。但对于这类工作负载，人们可能更愿意为低延迟支付更高费用。这就是低延迟、类似智能体类型的工作负载。

Right, right. Mean, people are getting around this problem by running multiple of these models, having four Cloud codes running at the same time, But that kind of for someone who I personally prefer deep work, when I work with the model, I usually just use one, which my partner yells at me for. She's like, you should be using four cloud codes at the same time. But for this kind of workload, people probably will be willing to pay more for this low latency. So that's the low latency, sort of agentic kind of workload.

Speaker 1

然后还有大批量处理。我不太关心延迟，只想要尽可能高的吞吐量。这对于生成合成数据等工作非常重要。正如我提到的，现在很多人训练模型的方式是：他们拥有少量专家级数据或人工标注数据。比如说在航空公司，你试图构建一个AI代理来处理客户投诉或冲突等。

And then there is the very large batch. I don't care so much about latency, but I just want as high throughput as possible. So this is important for things like generating synthetic data. So as I mentioned, a lot of the way people are training models now is that they have a small amount of expert level data, or the human annotation data. Don't know, let's say you're in airlines, you're trying to build an AI agent to resolve customer complaints or conflicts and so on.

Speaker 1

他们拥有少量高质量数据。然后你可以从中生成海量合成数据。模型模仿人类的能力真是令人惊叹。你可以说：嘿，你能扮演一个来自纽约的乘客吗？他因为从拉瓜迪亚机场起飞的航班延误而恼火。随便举个例子。

They have a small amount of really good data. And then you can generate tons of synthetic data out of that. It's kind of crazy how good the models are at acting like humans. You can say, hey, can you act as if you're this customer from New York who is annoyed that their flight is delayed out of LaGuardia. I don't know, all right?

Speaker 1

模型在模仿人类方面出奇地出色。

And models are surprisingly good at acting as humans.

Speaker 0

是啊，我有点害怕我们教会了它们像机场里愤怒的纽约人那样发脾气。这可不是人类最美好的一面。

Yeah, I'm scared that we've taught them the capabilities to be angry New Yorkers at an airport. That's not like humanity at its finest.

Speaker 1

没错。但不知为何，互联网上大量这样的数据让模型学会了这种行为，对吧？然后模型就可以利用这些数据。它们内部有这种世界模型，可以生成大量数据——可能不如人类数据那么好，但你可以生成海量数据，对吧？对于那种用例，推理用例，你真正关心的只有吞吐量。另一个是强化学习训练这类用例，一方面你在训练智能体执行某项任务，比如改变策略，但训练循环的部分环节是：当模型有了某种策略后，你需要评估这个策略的好坏，对吧？

Right, yeah. But somehow there's lots of that data on the internet where the model has learned that, right? And the model can then use that. They have this kind of world model internally that they can then generate lots of data that, probably not as good as human data, but you can generate a massive amount, right? And for that kind of use case, inference use case, you really only care about throughput, The other one is for this RL training kind of use case, where on one hand, you're training an agent to do something, you know, change this policy, but the way part of it, part of the training loop is that as the model has some policy, you need to evaluate how good that policy is, right?

Speaker 1

比如说我在训练一个AI工程师，我如何知道当前AI工程师的水平？这就需要我从模型中采样，这被称为rollout（展开）。所以你从模型中采样大量完成结果，然后评估其质量，对吧？因此作为训练的一部分，你确实需要非常大批次、高吞吐量的推理。所以我认为这是第三个用例——超大批次处理。对于这三个用例，我认为人们开始认识到这些模式。

So let's say I'm training an AI engineer, how would I know how good the current AI engineer is? And that requires me to sample from the model, and it's called rollout, So you sample tons of completions from the model, and then evaluate how good that is, right? And so as part of this training, you really need very, very large batch, high throughput inference. So I would say that's the third use case, very large batch. And so for these three use cases, I think people are starting to recognize these patterns.

Speaker 1

作为工具提供商，我们当然会进行不同的优化。

As instrument providers, we certainly do sort of different optimizations.

Speaker 0

对于在这三者之间分配资源，你是怎么考虑的？显然，我相信有不同的优化方案，对吧？

How do you think about allocating your resources across the three? Obviously, I'm sure there's different optimizations Right. That are a

Speaker 1

所以我认为这可能是大规模运行真正有帮助的地方。我们称之为集群级别的优化，当你在数千个GPU上进行推理时，可以动态调整集群分配。举个简单例子，比如运行批量推理、批量API。OpenAI有这个选项，我们也有。如果我们发现集群在处理这些交互式查询方面不太繁忙，就可以发送这些批量查询来充分利用计算资源，对吧？

So I would say this is where maybe running at scale really helps. So we call it fleet level optimization, where if you're running inference on thousands of GPUs, you can sort of dynamically change the cluster allocation. So as an example, a easy, simple example is running batch inference, batch API. OpenAI has an option for that. We also have an option for that, where if we see that the cluster is not so busy with these more interactive queries, we can just send in these batch queries that can sort of soak up the compute, right?

Speaker 1

因此，我们可以提供——我想我们给批量API大约50%的折扣，OpenAI应该也是这样做的，对吧？深海公司可能也在这么做。

And as a result, we can provide, I think we give like 50% discount on batch API, I think OpenAI does the same, right? Right. Deep Sea probably does that too.

Speaker 0

是的。大规模运行确实非常有帮助。对吧。当你思考推理市场的发展时，是否感觉有无限的优化空间，总是能稍微领先其他玩家？还是说在某些时候，容易实现的优化都已经完成，现在关键是构建更广泛的平台来在这些工作负载之上做其他事情？

Yeah. Having it at scale really really helps. Right. Right. As you as you think about kinda like the evolution of the inference market, does it feel like there's endless optimizations to do over time, and so it's always possible to stay a little bit ahead of the the rest of the players, or at some point have the low hanging fruit been plucked on each of these, and it's actually about building a broader platform to do other things on top of these workloads.

Speaker 1

早期确实有很多容易实现的优化，如果你编写合理的内核，构建合理的推理引擎，就能获得比现有方案好得多的结果。但现在开源工具变得非常非常好，像VLM和SGLAN这样的项目非常受欢迎，已经达到生产级质量。我们当然与这些团队合作并做出贡献。所以基础水平已经大幅提升，但同时也有越来越多新的用例出现。

Early on, there were lots of low hanging fruits, and if you write reasonable kernels, if you build a reasonable inference engine, you get much, much better results than what was out there. But now I think the open source tooling is getting very, very good, Things like VLM and SGLAN, some of these projects are very, very popular, and they've gotten to production level quality. And we certainly work with these folks and contribute and so on. So I think just the baseline has gotten so much better. But there's also just newer and newer use cases.

Speaker 1

客户来找我们说，他们非常关注前缀缓存，或者非常注重低延迟，因为他们的应用有这个需求。比如，他们实际上不是在处理文本，而是在处理视频，这需要不同的权衡，更注重吞吐量，我们与这类客户合作。所以即使开源工具和库变得如此优秀，工作负载的演进速度也非常快，总是有新的挑战。模型变得如此强大，现在有太多方式可以从这些模型中提取价值，这就是为什么我们看到这么多初创公司基于这些模型进行开发。结果就是工作负载会快速演变。

So customers come to us, and they say, oh, we really care about prefix caching, or we really care about low latency because in our application we need this requirement. Like, hey, we're actually not doing text, we're doing video, and that's sort of a different trade off, much more throughput oriented, and we work with some of those customers. So even as the open source tooling, libraries are getting so much better, I think the workloads are evolving so fast that there's always sort of new stuff And to do, there's this kind of explosion of like, the models are getting so good now that there's so many ways to extract value out of these models. That's why we see so many startups building on these models. And as a result, the workload's gonna evolve pretty rapidly.

Speaker 0

那么你认为，显然我们现在有这三类主要工作负载。展望一两年后，会不会发展成27种或大量不同的方式，每种都需要自己的优化方案？

So you think, obviously, we have these three main categories of workloads today. Do you feel like fast forwarding a year or two from now, it's like 27 or a ton of different ways and each requiring its own kind of optimizations?

Speaker 1

我认为还是会趋于整合的。自主代理型工作负载可能会成为杀手级应用。ChatGPT在应用层面是一个阶跃式变化，那是用户第一次接触语言模型，可以与之对话，帮助调试代码、查询信息、分析和综合信息。

I think it's still gonna coalesce, right? I I think the kind of agentic workload might be a killer use case. I think it's maybe a step change for ChatGPT was a step change in terms of application. That was sort of the first time users are exposed to language models. They can talk to and maybe help them debug their code or look up informations and analyze information and synthesize information.

Speaker 1

但我认为下一波应用将会非常庞大：AI能否自主采取行动和收集信息？这将需要不同类型的优化。现在不仅仅是在GPU上快速运行模型，还要解决如何连接人类通常使用的工具的问题，网络搜索是个简单例子。随着这些工作负载进入不同垂直领域，比如你是工程师，在设计飞机之类的。

But I think the next set of applications is going be gigantic, which is, hey, can AI, can these models sort of take actions and collect information by themselves? And that is going to require sort of different sets of optimizations. So now it's not just running the models really, really fast on GPUs, it's like how do you connect with the tools that humans usually do. Web search is an easy example. But as these workloads move to different verticals, let's say you're an engineer, I don't know, you're designing airplanes or something, right?

Speaker 1

你会希望模型能够访问你的设计软件。

You want the model to be able to access your design software.

Speaker 0

没错。

Right.

Speaker 1

或者如果你是一名金融分析师，你会希望模型能够接入某些特定数据库，对吧？所以我认为这类工作负载将成为未来一年左右我们的主要工作负载之一。这是我的预测。

Or if you're a financial analyst, you have certain database that you want the model to tap into, right? So I think this kind of workload will be one of the main workloads that we have in the next year or so. That's my prediction.

Speaker 0

围绕智能体的系统级工作显然会带来一系列全新的待解决问题。因此，即使原始工作负载已有优秀的开源工具，现在又出现了完全不同的新工作负载，这些也需要优化。进行系统级工作、接入外部数据库等任务，与单纯优化推理不同，这似乎是你们产品自然演进的方向。

And that systems level work around agents then obviously becomes a whole new set of problems to solve. So, it feels like even if the original set of workloads, know, there's great open source tooling, there's now entirely new set of workloads that also require optimizations and Getting into some of that systems work and accessing external databases and other things is different than just optimizing inference, it seems like a natural evolution of the product you guys are providing.

Speaker 1

是的，是的。我认为这更像是专业级或企业级的主要应用场景。在消费者层面，我的一个预测是我们将实现实时视频生成。没错，我们已经开始看到一些迹象了。

Yeah, yeah. I think that's sort of the main, gonna be the main use case for more like professional level or enterprise level. I think on the consumer level, one of the I think one of my bets is like we'll get real time video generation. Yeah. And we're starting to see some of that.

Speaker 1

我认为这将彻底改变消费者领域的格局，就像TikTok彻底改变了内容生态一样——无论好坏。但这确实非常具有吸引力，对吧？在我们合作的Pika Labs和Hedra等公司，他们正专注于实时视频生成。这就是我们的赌注。

And I think it's going to really change sort of the consumer landscape to the effect of TikTok really changed that You landscape, know, for better or for worse. I don't know. But it's very, very engaging, right? And so on the consumer side, think some of the companies we work with, like Pika Labs and Hedra, they're focusing on real time video generation. That's the bet, Right.

Speaker 0

这也带来了一系列全新的待解决问题。

There's a whole new set of problems to solve, too.

Speaker 1

没错。这意味着需要更庞大的计算资源。视频生成对计算能力的要求极高，这可能会推动芯片需求和推理优化的进一步发展。

Right, right. And so it's just a lot more compute. Video generation's very, very demanding on the compute side. So that might be fueling even more chips, even more inference optimization.

Speaker 0

非常有意思。你们正在进行大量研究并提前布局，但行业方向始终存在重大未知问题。我很好奇：如果能够穿越到三年后，获得一个关于AI基础设施领域问题的答案，从而帮助你们明确当前方向——那个你最想知道、或对你工作方向影响最大的未解之谜是什么？

Super interesting. I mean, obviously, you're conducting a lot of research and making these bets ahead of, there's always big outstanding questions of where the space is headed. I'm curious if you could forward to the future three years from now, and get the answer to one question in the AI infrastructure world that would really help you shape your direction today, what's that question that's outstanding that you're most curious to know, or would most impact the direction of your work?

Speaker 1

未来几年我试图回答的问题是：如何让AI达到专家水平？目前模型在某些任务上处于人类中等水平。前端编程它们已经相当出色——说实话肯定比我强得多。或者在数据分析这类网络数据丰富的任务上

The question I'm trying to answer for the next couple years is how do we get AI to expert level? So right now, I think the models are at the median human level on some of the tasks. Front end programming, they've gotten quite good. Actually, the models are much better than me at front end programming, for sure. Or analyzing data, that kind of task where there's lots of data on the internet.

Speaker 1

只要网络数据充足，这些模型就能轻松胜任。它们在某些任务上达到中等或略高于平均水平，但经济价值更高的任务仍然存在。人们愿意花重金聘请人类专家，比如飞机设计、硬件设计、医生、律师等领域。这些专家之所以成为专家，是因为他们长期使用专业工具进行实践。

Whenever there's lots of data on the internet, these models are going to just crush it. So they've gotten to sort of the median or maybe slightly above average level on some of these tasks, but I think the economically valuable tasks are still there. Pay lots of money for human experts in, I don't know, designing airplanes or designing hardware or doctors and lawyers and so on. Those people become experts because they spend time working with specialized tools. Spend time Or

Speaker 0

他们没有足够互联网数据量，所以才是专家。

they don't have an internet worth of data that is why they're experts.

Speaker 1

没错。那么我们如何让模型达到那种水平，与人类并肩工作？我认为这正是经济价值的主要来源。

Right. So how can we get models to get to that level to work alongside with humans at that level? I think that's where a lot of the economic value is gonna come And

Speaker 0

显然接下来需要围绕实现方式做硬件优化。你研究过状态空间模型，花了很多时间思考替代架构、Transformer。你的合作者Albert Gu说过Transformer本身不会是最终解决方案。你认为我们需要架构创新来达到那个水平吗？

obviously then the hardware optimizations you have to do around whatever the way to do that is. You've worked on state space models. You've spent a lot of time thinking about alternative architectures, transformers. I think your collaborator, Albert Gu, said that transformers are not gonna be the final solution by themselves. Do you think we need architectural innovation to get us to that level?

Speaker 1

对，我很欣赏Albert，但这里我们可能有点分歧。我认为要实现AGI或ASI，现有架构可能就足够了，但代价是什么？如果有更好的架构，或许能提前一两年实现。

Right. I love Albert, But maybe this is where we disagree a little bit, right? So I think to get to AGI or ASI, it's possible that the current architecture we have is sufficient, right? But at what cost? Which is, hey, if you have a better architecture, maybe you get there, I don't know, one or two years sooner.

Speaker 1

我觉得这很值得。或者用十分之一的成本就能实现，对吧？

I think that's probably worth it. Or you get there with 10x less of the cost, right?

Speaker 0

这最终会影响到ASI实现时的可及性。

Which ultimately flows to them, the accessibility of that ASI when you get there.

Speaker 1

没错。这至关重要，因为我们每年在AI基础设施上投入约5000亿美元（别直接引用这个数字，但数量级差不多）。我们需要投入十倍资金吗？这似乎不现实。还是通过更好的架构，用现有投入甚至更少就能实现？

Right, right. And that's certainly crucial because we're spending, I don't know, dollars 500,000,000,000 on AI infrastructure every year. Don't quote me on that, but I think that's sort of the right ballpark, right? So do we need to spend 10x that, which seems somewhat unrealistic? Or with better architecture, we get there with the current amount of spending, or maybe even less?

Speaker 1

所以现代架构将发挥关键作用：能否通过更好架构实现AGI？现有架构已包含所有必要元素，持续扩展确实可能实现，但成本可能极其高昂。

So I think that's where the modern architecture is going to play a role, which is can we get to AGI with a better architecture? I think the current architectures sort of have all the right ingredients, and if you keep scaling, which people have been doing, you could get there, but maybe the cost is astronomical.

Speaker 0

你还关注哪些其他架构？

What other architectures are you paying attention to?

Speaker 1

我对混合专家模型特别感兴趣，尤其是越来越稀疏化的方向。我们正在探索稀疏化的极限，DeepSeq的研究表明可以实现极高稀疏度，DeepMind早期也有相关研究。

Right, so I'm pretty excited about a mixture of experts, especially getting sparser and sparser. We're trying to push that limit, like how sparse can you go, and we're working on that. Think it's been a compelling direction. Some really important work done by DeepSeq showing that you can make these things really, really sparse. There were even earlier work from DeepMind pushing in that direction as well.

Speaker 1

所以我认为这是一种很有说服力的方式，可以在相同计算量下获得更多智能。最终，我认为我们希望优化每美元的推理成本。这意味着可以将其分解为每浮点运算（FLOP）的推理效率，以及每美元的浮点运算量——前者更关乎架构、设计、数据和算法，而后者则涉及硬件和内核优化。因此在架构方面，我们正试图从相同的计算中提取尽可能多的信息（抱歉，是智能）。

So I think that's one compelling way to sort of get more intelligence out for the same compute. Ultimately, I think we want to optimize inference per dollar. And that means it could be factored into inference per FLOP, floating point operation, and FLOPs per dollar, which is inference per FLOP is more like architecture, design, data, algorithms. And then the other side, flops per dollar, is hardware and kernel optimization. So on the architecture side, we're trying to extract as much information, sorry, as much intelligence out of the same computation.

Speaker 1

混合专家模型是其中之一。我认为Albert在状态空间模型方面的工作非常非常有趣，而且大家正在采纳这些成果。我们与NVIDIA的许多同事合作训练模型，他们发布了多个案例展示如何用这类架构获得极高品质的模型。这是Transformer和Mamba等的混合体，但成本更低或推理性能大幅提升。

So a mixture of expert is one. I think some of the state space stuff done with Albert has been really, really fun, and people are picking that up. We've collaborated with a bunch of folks from NVIDIA, actually training models. They release several of these where they show, hey, they can get really, really high quality model using this kind of architecture. It's a hybrid of transformer and mamba and so on, but with much smaller cost or much, much higher inference performance.

Speaker 1

我认为架构对推理至关重要，这正是架构研究的价值所在。如今我高度关注'推理优先的架构设计'，因为大部分计算资源都消耗在推理上，所以必须设计出能让推理表现极其出色的架构。

I think architecture is very important for inference. I think that's the utility of architecture research. And nowadays, I think about very much, I call it inference first architecture design, which is most of the FOPs are being spent on inference anyway, so you really, really wanna design architecture that make inference really, really good.

Speaker 0

显然您在推进Flash Attention推理类型和开创Mamba状态空间模型方面做出了贡献。大家肯定好奇：您现在关注哪些研究领域？时间主要投入在何处？下一篇重磅论文会关于什么？

Well, I'm sure, obviously, you've pushed forward the inference type of Flash Tension, with Mamba, you've been pioneering state space models. I'm sure folks are curious, what research areas are interesting to you now? Where are you spending your time? What's the next big paper we should expect from you?

Speaker 1

没错，我仍然深耕这些领域，但也在探索新方向，包括研究下一批具有重大影响的应用场景。我认为机器人技术就是其中之一。

Right. Yeah, I'm still certainly working in these areas. I'm still very much in these areas. But I'm exploring some new directions, some of which involve working with sort of what are the next set of really impactful applications. And I think robotics is one of them, right?

Speaker 1

机器人技术——我们距离实现能在家庭中高效工作的人形机器人还有多远？五年？十年？确实难以预测。这通常意味着研究层面存在许多值得探索的精彩问题。

And robotics, like how far are we away from having really good humanoid robots that work in homes? Maybe five years, maybe ten years? Yeah. I don't know, right? And that usually means a lot of interesting and exciting questions on the research side that we wanna answer.

Speaker 1

所以这是研究层面的一个方向

So that's a direction on the research side

Speaker 0

是的

that Yeah,

Speaker 1

在机器人研究领域中最吸引您的是什么？现有基础模型可以初始化控制机器人的模型，语言模型能进行任务规划（比如让机器人取咖啡杯时会说'走到桌边拿起咖啡'）。但缺失的是与世界交互和行动的能力。

what's most interesting to you within the robotics research area? So robots, we can initialize a model controlling a robot with some of the foundation models that we have already. You can use a language model to do the planning. Like, if you ask a robot to pick up a coffee cup, then the language model can say, go walk over to that table and pick up the coffee and so on. But what's missing, I think, is this kind of interacting and acting the world.

Speaker 1

对吧？我们恰恰缺乏这类数据。虽然我们拥有语言数据...

Right? We just don't have data for that. We have data. We have language data on

Speaker 0

没错，而且你显然看到一些人尝试扩展仿真数据。另一些人则在操作遥操作，但这显然是一个实际执行上的数据问题。

Right, and you've obviously seen some folks trying to scale simulation data. Other folks are doing teleop, but clearly a data problem on the actual Right, actuation

Speaker 1

所以这肯定是一个方面。另一个方面是，我认为机器人需要以多分辨率、多时间尺度的方式处理信息，其中一些是，比如控制关节时必须非常非常快速地行动。但如果是为机器人规划路线，就可以慢得多地进行。

so that's certainly one. The other one is, I think, robots kind of have to process information in a sort of multi resolution, multi TAM scale way, some of which is, you know, if you're controlling the joins, have to act very, very, very rapidly. But if you're planning a route for the robot, you can do it much, much slower.

Speaker 0

所以这就像你可能想要的不同架构，我的意思是，这似乎与你做过的某些状态空间模型工作非常相似。

So it's like different architectures you might want I mean, struck, that seems very analogous to some of the state space model work you've done.

Speaker 1

对，对。所以明确考虑时间尺度，我是想做非常轻量级的计算来控制关节，还是想做更重量级的推理来规划机器人的最优路线？我认为这将是一种由语言模型、视觉模型、音频模型、世界模型初始化的复合系统，但如何将它们整合在一起？我认为这是个大问题。

Right, right. So explicitly taking TAM scale into account, do I want to do very lightweight computation to just control the joints, or do I want to do much heavier weight reasoning to plan the optimal route for the robot? So I think it's going to be this sort of composite system initialized from language models, vision model, audio model, world model, but how do you stitch them together? I think that's the big question.

Speaker 0

嗯，要能在本地足够快地运行这些东西，还有一些引人入胜的硬件问题。是啊，是啊。我感觉你几乎可以在任何你想的地方工作，而且我认为一件非常有趣的事是，你显然，你知道，是Together的首席科学家，但你也走了学术路线，是普林斯顿大学的教授。

Well, some fascinating hardware questions to be able to run this stuff locally and fast enough. Yeah. Yeah. Yeah. I feel like you could kinda probably work anywhere you want, and I think one really interesting thing is you obviously, you know, chief scientist over together, but you've also, you know, gone the academic route as well and you're a professor over at Princeton.

Speaker 0

你是如何考虑既留在学术界，同时又研究那些在机器人领域有意义的问题的？比如，显然你可以去Physical Intelligence工作，或者与一些机器人公司合作，或者在学术环境中进行研究。你如何看待在学术界与工业界分别从事哪些工作是有意义的？

How did you think about kind of staying within academia and then also like the set of problems that it makes sense like, you know, in this robotic space? Like, obviously, you could go work at physical intelligence or partner with like some of these robotics companies or do the work from an academic setting. How do you think about like what makes sense to work on within academia versus industry?

Speaker 1

对，对。是的，这是个很好的问题。我的答案会是个人的，对吧？对每个人来说都会不同。就我个人而言，我其实很喜欢既在初创公司又在大学当教授。

Right, right. Yeah, that's a great question. My answer is gonna be personal, right? For every person, it's gonna be different. So for me personally, I actually like being both in startup and being a professor.

Speaker 1

这呈现出不同的思维和执行模式。在初创公司这边，非常有趣，因为我们行动迅速，对吧？如果我们想做什么，几天或几周，最多几个月就能完成。这就是我们规划的时间范围，对吧？这非常有趣。

It sort of presents different modes of thinking and execution. On the startup side, it's very fun because we move fast, right? If things we want to do, we could get it done in days or weeks, or at most months. That's the level of the horizon that we plan for, right? And it's very fun.

Speaker 1

你有一个非常非常优秀的团队。他们执行速度极快，能成就伟大的事情。我为团队在Together所做的工作感到非常非常自豪。所以这一直非常非常有趣。在学术界这边，我想时间跨度要长一些，我们思考的问题可能更具推测性。

You get a team of really, really good people. They can execute very, very fast, you accomplish great things. And I'm very, very proud of the work that the team has done at Together. So that's been very, very fun. On the academia side, I guess it's a little bit longer time horizon, and the kind of problems we think about are maybe more speculative.

Speaker 1

我们不解决需要在一个月内解决的问题。我们思考的是，哦，如果世界在两三年内朝那个方向发展，有哪些有趣的问题要提出，以及在一两年内要解决哪些挑战？我们以那种时间尺度思考。所以与学生一起工作也非常有趣和有益，因为我们可以思考这些本身就很有趣的问题，对吧？显然，也存在一些权衡，比如学术界的计算资源要少得多。

We don't solve the problem that would need solution in a month. We think about, oh, if the world is moving in that direction in two, three years, what are the interesting questions to ask, and what are the challenges to solve in one or two years? We think in that time scale. So working with students has been very fun and rewarding as well, because we get to think about these questions that are interesting, sort of just intrinsically interesting, right? Obviously, there are certain trade offs, like the amount of compute in academia is much smaller.

Speaker 1

你的评估方式是看你的想法是否有趣，而不是它是否运行得很快等等，对吧？那是另一套评估标准。所以我认为在学术界，有更多自由去深入思考这些更长远的问题。而我恰好两种模式都喜欢，这就是为什么我仍然在普林斯顿担任教授。

The way you evaluate it is, are your ideas interesting, rather than, oh, does it work and run really fast and so on? That's a different set of evaluation, right? So I think in academia, have some more freedom to think deeply about these longer horizon problems. And I happen to like both, both modes of operation. So that's why I'm still at Princeton as a professor.

Speaker 1

我仍然参与创业公司。我认为最终可能是探索与利用的结合。从宏观层面看，学术界的角色更侧重于探索。资金通常来自政府，政府关心的是能否探索大量想法，看看其中可能有5%或10%会成功。

I'm still involved in the startup. And I think ultimately it's maybe a mix of kind of exploration and exploitation. If you think at a macro level, academia's role is more on exploration. The funding comes from usually the government. The government cares about, can you explore tons and tons of ideas to see maybe five or 10% of those ideas will work out.

Speaker 1

我的意思是，作为风险投资人，你也处于类似的模式，即我们要探索很多想法，其中可能有5%或10%会变得极其重要。注意力机制就是一个例子——尽管因谷歌论文而闻名，但它实际上源自Mila研究院Dmitry、Yoshua Benjio等人的工作，是从学术界产生的，对吧？

I mean, an investor, as a VC, that's kind of the mode you operate in as well, which is, hey, we're going to explore lots of ideas. Maybe five or 10% of these ideas will become incredibly important. One example is attention. Attention, even though it got famous with the Google paper, it actually came out of the work at Mila, with Dmitry and Yoshua Bencio and other folks. It was kind of coming out of academia, right?

Speaker 1

如果你想想当前架构中的所有其他组件：优化器Adam是多伦多的Jimmy Bhatt等人提出的，Layer Norm也是Jimmy的工作。本质上，我们现在拥有的很多东西都来自学术界，因为学术界的角色就是探索大量想法，其中一些会成功。然后大公司和初创企业的角色是采纳这些想法进行创新，快速执行，但他们更了解市场需求和应该构建什么产品。

And if you think of all the other components in the current architecture, the optimizer, the Adam optimizer was Jimmy Bhatt at Toronto and other folks, right? Leia Norm, also Jimmy, right? And essentially, lots of what we have now came out of academia because the role of academia is to explore lots and lots of ideas. And some of them will work, right? And then the role of big companies and start up is they take some of these ideas, they do innovate on them, they execute really, really fast, but they understand sort of what the market needs, What the product they should build.

Speaker 1

他们拥有更强大的财力来推动其中一些更大的想法实现。

And they a much, much bigger sort of financial power to make some of these bigger ideas happen.

Speaker 0

对的，当前融资环境的有趣之处在于，我觉得它几乎扭曲了这种分工——现在有大量风险资金可供人们建立类似研究实验室的机构。比如如果你提出'机器人技术未来需要不同架构和硬件'，可能会有人说：'好，虽然十年内无法盈利，但这是个有趣的想法，给你一笔风险投资去实现吧'。

Right, the interesting thing about this funding environment is I feel like it's almost distorted some of that where there's plenty of venture funding available for folks to do kind of more research lab. Like if you pitched, hey, robotics is gonna need different architectures and different hardware over the years, someone might say, okay, cool, you're not gonna monetize for ten years, but that's a really interesting idea and here's a bunch of venture money to go out.

Speaker 1

没错，人们获得的融资规模确实有些疯狂。像Elia SSI就明确表示不会构建任何产品，但人们就因为他是Elia而愿意投钱。正如你所说，机器人技术领域的一些人也在这样做。AI领域的一些风险开始获得回报，所以投资方更愿意投入资金。

Right, yeah. It's kinda crazy the kind of funding that folks are getting, right? Like, in Elia SSI, he was like explicitly, we're not gonna build any product, But yeah, people wanna give him money because he's Elia, for sure. And some of the folks, like you said, in robotics are doing that. So yeah, think people are, in AI, it seems like some of these risks are starting to pay off, so people, on the investing side, people are much, much more willing to bid in money.

Speaker 0

这真是一次精彩的对话。我们总是喜欢在访谈结束时进行快速问答环节。首先，过去一年里你在AI领域改变过什么看法？

Well, this has been a fascinating conversation. We always like to enter interviews with a quick fire round before we get And your take on some so, maybe to start, what's one thing you've changed your mind on in AI in the last year?

Speaker 1

这些模型即使对我的高级专家日常工作也出乎意料地有用——它们在数学和编程方面表现得异常出色。

These models are surprisingly useful even for my daily work at the sort of the advanced and expert level. Like, they're surprisingly good at math and coding.

Speaker 0

1.5倍这个数字远高于我的预期，令人印象深刻。你认为一年后开源模型与闭源模型的质量差距会缩小还是扩大？

Yeah. 1.5x is way higher than I would've expected you to say. It's pretty impressive. Do you think open source models are gonna be closer or further in quality from closed source models in a year?

Speaker 1

我会说更接近了。我认为现在的扩展更多是在强化学习方面，而这实际上更依赖于工具而非原始算力。所以我认为开源在这方面会表现得非常出色。

I would say closer. I think the scaling now is more on the RL side, and that relies actually more on tooling than just raw compute. So I think open source is gonna do really well there.

Speaker 0

在更广泛的AI领域发展中，你认为哪些方面目前没有得到人们足够的关注？

What development in kind of the broader AI space do you think people aren't paying enough attention to right now?

Speaker 1

绝对是数据。我认为数据总是有点被低估。数据方面已经发生了很多变化，比如合成数据、使用模型进行重述，这些都具有巨大影响，但可能人们关注得还不够。

Definitely data. I think data is always a little bit under hype. I think lots have happened on the data side, synthetic data, using models to rephrase, that has huge impact that maybe people have to pay less attention to.

Speaker 0

你有没有见过基于Together构建的特别喜欢的应用？

Do you have a favorite application that you've seen built on together?

Speaker 1

有的，我们与一些视频生成公司如Pika和Heedra合作过，他们使用在Together上训练并推理的模型生成病毒式的TikTok视频，效果非常惊人。

Yeah, we worked with some of these video generation companies like Pika and Heedra, and they're generating viral TikTok videos with the models they train on together and doing inference on together, so it's been amazing.

Speaker 0

是啊，这感觉未来会有很大发展空间。太精彩了，这次对话非常有趣。我想把最后的话留给你。大家可以去哪里了解更多关于你的信息？无论是你在Together的工作、学术研究，你想指引大家去哪里——话筒交给你了。

Yeah, and that feels like poised for a lot of development. Well, amazing, this has been a fascinating conversation. I wanna make sure to leave the last word to you. Where can folks go to learn more about you, either your work at Together, your academic work, you wanna point folks, the mic is yours.

Speaker 1

当然可以。关于我在Together的工作，我们会在together.ai上发布博客文章。我也在Twitter上，账号是treedao。偶尔也会在我的个人网站threedao.me上写博客。

For sure, yeah. So my work at Together, we put out a blog post on together.ai. I'm also on Twitter, treedao. Once in a while, write blog posts on my website, so threedao. Me.

Speaker 0

太棒了。Tree，真是收获颇丰。这次对话非常精彩。

Amazing. Well, so much, Tree. This was fascinating.

Speaker 1

是啊，这次聊天真的很愉快。谢谢。

Yeah, this has been really fun. Thanks.