第2集：Databricks首席技术官Matei Zaharia谈大规模语言模型的扩展与编排

本集简介

帕特里克和雅各布与Databricks联合创始人兼首席技术官、斯坦福大学教授马泰·扎哈里亚展开对话。他们探讨了企业如何利用Databricks在生产环境中训练和部署模型，大型语言模型在搜索领域的短板及改进方向，斯坦福大学前沿人工智能研究现状，以及未来几年技术进步将如何改变模型的规模和成本。(0:00) - 开场(2:04) - Databricks创立故事(6:03) - 博士同学用早期Spark版本参加Netflix竞赛(6:55) - 用MLFlow构建应用(9:55) - 大型语言模型与ChatGPT(12:05) - 基础模型的使用与微调(13:00) - 提示工程是长期存在还是暂时现象？(15:12) - 马泰在斯坦福的研究：演示-搜索-预测框架(DSP)(17:42) - 如何将大型语言模型与传统信息检索系统结合以实现顶级搜索(19:38) - 大型语言模型编写程序来协调大型语言模型(20:36) - 在Databricks云产品中使用大型语言模型(24:21) - 扩展大型语言模型的训练与部署(27:29) - 未来几年训练大型语言模型的成本会下降多少？(29:22) - 多少参数才算过多？(31:14) - 开源与闭源之争？(35:19) - 斯坦福AI研究：Snorkel、ColBERT等(38:58) - 马泰工作数周获得50美元亚马逊礼品卡(43:23) - 快问快答环节主持人阵容：@jasoncwarner - GitHub前CTO，Heroku与Canonical工程副总裁@ericabrescia - GitHub前COO，Bitnami创始人（被VMWare收购）@patrickachase - Redpoint合伙人，LinkedIn前机器学习工程师@jacobeffron - Redpoint合伙人，Flatiron Health前产品经理

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

欢迎收听Redpoints AI播客的‘无监督学习’栏目。我是Patrick Chase，与我一同主持的是Jacob Efron。我们非常荣幸邀请到Mattei Zaharia参与本期节目。Matej是Databricks的首席技术官兼联合创始人。

Welcome to unsupervised learning, Redpoints AI podcast. I'm Patrick Chase. I'm joined by Jacob Efron. We are so excited to have Mattei Zaharia on the show. Matej is the CTO and cofounder of Databricks.

Speaker 0

在本期节目中，我们将深入探讨Databricks的创立故事、斯坦福大学的最新研究进展（包括其团队正在构建并开源的demonstrate search和predict框架）、Databricks如何运用大语言模型、大语言模型在搜索领域的缺陷及其改进方案、开源与闭源之争，以及一些趣闻轶事——比如Databricks早期Mate曾为几周工作获得50美元亚马逊礼品卡。这期节目精彩纷呈，感谢您的收听。今天我们无比激动地欢迎Mate Zaharia做客节目。若AI技术领域有总统山，Mate必定位列其中。

In this episode, we dive into the founding story behind Databricks, some of the latest research at Stanford, including the demonstrate search and predict framework that his group is building and open sourcing, how Databricks is using large language models, the flaws that large language models have, especially when it comes to search and how to fix them, open source versus closed source, and some fun stories like Mate getting a $50 Amazon gift card for weeks of work in the early days of Databricks. I think it's a a super awesome show. Thanks for joining us. Today, we are so excited to welcome Mate Zaharia to the show. If there was a Mount Rushmore for AI technologists, Mate would definitely be on it.

Speaker 0

他是Apache Spark的创造者（该项目是最广泛使用的AI基础设施之一），同时担任Databricks创始人兼CTO。Databricks围绕Spark构建的数据平台已达到惊人规模：八月ARR突破十亿美元，累计融资35亿美元（资方包括a16z、NEA、CO2等），最近一轮融资后估值达380亿美元。不仅如此，Matej还是斯坦福大学教授，专注于计算机系统与机器学习研究，并领导为期五年的斯坦福DAWN项目——该项目致力于推动AI民主化。

He is the creator of Apache Spark, one of the most widely used AI infrastructure projects, and is the founder and CTO of Databricks. Databricks famously built a data platform around Spark that has reached incredible scale. They passed a billion in ARR in August, raised 3,500,000,000.0 in funding from a 16 z, NEA, CO2, and more. And their last round of funding put the company at a $38,000,000,000 valuation. If that wasn't enough, Matej is also a professor at Stanford focusing on computer systems and machine learning and one of the leaders of the Stanford Dawn project, which is a five year research project to democratize AI.

Speaker 0

Mate，非常感谢你参与节目，我们对此期待已久。

Mate, thank you so, so much for joining us. We're really excited for the show.

Speaker 1

非常感谢邀请。

Thanks a lot for having me.

Speaker 0

或许我们可以先聊聊你创建Databricks的历程。它最初只是你与Apache Spark的研究项目，如今已被全球7000多家机构采用，包括40%的财富500强企业，这样的影响力范围实在令人惊叹。你是怎么想到创立Databricks的？

Maybe to start, it'd be it'd be awesome to talk about your journey building Databricks. You know, it started as your research project with Apache Spark, and then now it's used by more than 7,000 organizations worldwide, including 40% of the Fortune 500, which is just really incredible scope that you guys have reached. How did you come to start Databricks?

Speaker 1

没错。正如你提到的，这基本上源于我博士期间的研究。当时我在加州大学伯克利分校攻读博士学位，对大规模数据密集型计算非常着迷——那时正是这类技术开始兴起的时候，谷歌等网络公司，后来的Facebook，还有当时的雅虎，这些企业都在索引整个互联网，构建搜索引擎和其他产品。我特别希望能将这类技术普及化，毕竟数据存储和收集的成本并不算高。

Yeah. For sure. So it basically started out of some of my research as a PhD student, as you mentioned. So I was doing my PhD at UC Berkeley, and I was really interested in, you know, large scale data intensive computing, which was just starting to become a thing back then with web companies like Google and later Facebook, you know, Yahoo back then, all these companies that were indexing the whole web and, like, building search engines and other products. And, you know, I was excited about bringing this kind of technology to, you know, to all users, like storing collecting data, storing it isn't that expensive.

Speaker 1

那为什么不能让所有人都进行真正的大规模计算呢？事实上，当时各行业和科学领域都出现了收集多类型数据的趋势。同时我也热衷于让人们能在数据上运行更复杂的算法——这正是早期MapReduce用户给我的启示。当数据量如此庞大时，你绝不会满足于简单分析，这简直是机器学习的绝佳场景。

So why why can't everyone do, you know, really large scale computations? And indeed, there was this trend that in most areas of industry, most areas of science, you could start collecting lots of types of data. And then I was also excited about letting people run more sophisticated algorithms on them because that's what I saw at the early MapAeduce users. You don't wanna just do analytics, especially when you have so much data. It's perfect for machine learning.

Speaker 1

特别是当深度学习兴起后，这完全是无监督学习的理想平台。于是我们先启动了这个快速成长的开源项目，市场需求显而易见。后来我们决定成立公司来真正扩大影响力，因为仅靠无人支持的开源项目能走的路实在有限。当时我们还做了个颇具争议但最终被证明正确的决定：完全采用云服务模式。

It's perfect for unsupervised learning, you know, when when deep learning became popular and so on. So, yeah, so we started first this open source project that very quickly grew. It was clear that there's demand for something. And so and we decided that, you know, we we should start a company to, like, really, you know, have an impact in this space because you can only get so far by just having an open source project with with no one backing it. And the other bet we made that was a bit controversial at the time, turned out to be right, is we decided to go cloud only.

Speaker 1

也就是说，用户只能在三大云服务提供商之一的平台上以SaaS形式使用Databricks。那可是2013年，银行等大型企业是否会迁移到云端还不明朗。但这个赌注押对了，它让公司专注于提供只有云端才能实现的卓越体验，最终使我们成为这个领域的主要选择之一。

So, like, you you can only use Databricks in the cloud on one of the the three major cloud providers as a, SaaS service on top of that. And this was 2013, so it wasn't as obvious that, you know, banks and, you know, other really big enterprises would move to the cloud. But it turned out to be the right bet, and it, like, really focused the company on delivering, you know, excellent experience there with things you can only do in the cloud. And it helped us, you know, land as one of the big options if you if you wanna do these things in there.

Speaker 0

是的，太棒了。看到它发展到今天这个程度真是超级酷。我记得你们刚开始的时候，我其实在LinkedIn的机器学习团队工作，我们是Apache Spark的早期使用者之一。我想起来你和Reynold他们来做过一次演讲。

Yeah. Amazing. It's super cool just to see how how far it's come. I remember when you were just getting started, I actually was on the machine learning team at LinkedIn, and we were some of the early adopters of Apache Spark. And I think you and Reynold and you guys came and did a talk.

Speaker 0

你们构建的技术以及它如今的发展程度真的令人惊叹。我记得其中很重要的一部分是为了那些复杂算法。你们转向了内存中的训练和计算，这样迭代计算就比Hadoop快得多。机器学习是Spark从一开始就瞄准的主要应用场景吗？还是说随着时间的推移，它才逐渐显现出在模型训练等方面的优势？

And it's just it's just amazing, the technology that you've built and then how how far it's come. And I remember a big part of it was kind of for these sophisticated algorithms. You're moving to training things, you know, to doing things in memory, so iterative computation would be a lot faster than for Hadoop. Was machine learning a target kind of a target use case for Spark from day zero, or was that something that, you know, over time, it started to really emerge as, like, this is great for training models and and that sort of thing?

Speaker 1

没错，实际上它从一开始就是主要应用场景之一。我们在伯克利构建的第一个应用就是机器学习相关的。因为我所在的实验室里有很多系统研究人员，比如Dave Patterson，但他们和机器学习大牛Mike Jordan共同建立了这个实验室，Mike在理论和应用机器学习多个领域都有建树。

Yeah. No. It actually was one of the main use cases from the beginning. And actually, the first applications we built on it on Berkeley were machine learning ones. Because I was sitting in this lab that had it had, you know, a lot of the systems researchers at Berkeley, like Dave Patterson, for example, but they had set up this lab with Mike Jordan, who is a huge machine learning researcher doing both theoretical machine learning, applied ML in different areas, and so on.

Speaker 1

所以他们想探索计算机系统与机器学习的交叉点：能否用ML改进计算机系统？又能否用这些新兴的大规模系统更好地进行机器学习？我在工业界见过人们使用MapReduce等技术，但在伯克利，我想找本地人尝试我那些关于编程模型的疯狂新想法，运行可能还有些bug的系统。周围正好是机器学习研究者们，他们非常乐于尝试。

And so they wanted to look at the intersection of computer systems and machine learning. Can we use ML to improve computer systems, but also can we use, like, these large scale emerging systems to to do machine learning better? So I had seen people using, you know, MapReduce and other things in industry. But at Berkeley, I wanted to find someone local that would actually try out, you know, my crazy new ideas for a programming model and, like, run a system that, you know, still has a few bugs and stuff. And so the people around were the machine learning ones, and they were they were very excited to try it.

Speaker 1

比如我入学那年有个同学叫Lester Mackie，他后来在斯坦福任教，又去了微软研究院，成了全职AI研究员。当时他正带队参加Netflix推荐引擎竞赛，他告诉我们：'我们真的需要可扩展性'。

So, for example, one of my classmates in in the year I started was Lester Mackie. He he went on to, basically, to get a faculty job at Stanford, and then he went to Microsoft Research. So he, you know, he became, like, a full time, you know, AI researcher. And at the time, he was part of a team entering the Netflix prize competition to build a recommendation engine. And he told us, we, you know, we really need scale.

Speaker 1

我们需要能够实现这些想法，完成所有这些训练任务。比赛前我们只有这么多天时间，所以必须提升效率。他实际尝试后确实加速了一些环节，他的团队在比赛中表现非常出色，最终获得了第二名。当然还有其他参赛队伍。

We we need to be able to and we have all these ideas, all these training runs we have to do. We only have this many days before the competition, so we need scale. So he actually tried it out and was able to accelerate some things, and, you know, his team did really well in the in the contest. They placed second at the end. And there were also other groups.

Speaker 1

是的，这正是让我们感到兴奋的地方。

So, yeah, we that's something we were excited about.

Speaker 2

确实。我认为这开创了一个很好的传统，现在公司支持许多企业在尖端机器学习模型上进行开发。我很好奇的是，作为Databricks的一员，你处于一个非常独特的位置，能够观察到大量开发者在其基础上构建有趣的应用。所以我想了解，你注意到了哪些趋势性现象，以及人们是如何将这些机器学习模型投入实际生产的。

Yeah. I mean, it's a a good start to a tradition of, I guess, now the company's supporting many companies building kind of cutting edge ML models on top. And, you know, I I guess I'm curious. You sit at you know, at Databricks, you sit at this kind of really interesting opportunity where you you get to observe, I guess, a ton of folks that are building interesting, you know, applications on top. And so I'm wondering, I guess, what you've kind of noticed trend wise and how folks are deploying some of these ML models in production.

Speaker 2

当然，你本人也深度参与了MLflow及相关支持工具的研发。一方面想了解你观察到的趋势，另一方面，随着ChatGPT的爆发和大型语言模型引发的热潮，过去几个月应该也出现了不少有趣的应用案例。很想听听你看到的实际落地情况。

And then obviously, you yourself have been very involved with MLflow and building some of the kind of supporting tooling here. And so would love just one, kind of the trends you've seen in that. And then two, I imagine with ChatGPT and the kind of rise and all this tension around LLMs that the last few months might have been pretty interesting as well. So just anything you're kind of seeing on the ground and and how folks are using these.

Speaker 1

确实如此。我重点说两点：首先，企业级机器学习应用——不一定是深度学习——已经日趋成熟。虽然深度学习占比在增加，但机器学习在日常业务流程中的应用现已非常普遍。

Yeah. Definitely. So I'll call out, like, two things. So first of all, I think just ML in the enterprise, maybe maybe not necessarily deep learning. Increasingly, there's a lot of deep learning, but just, hey, ML in, you know, like, day to day, like, business processes is established now.

Speaker 1

回想我们最初起步时，很多人质疑企业是否真的需要机器学习？他们具备相关专业知识吗？比如，企业内部有人能协助吗？你相信这能帮到他们，但那里真有人才吗？诸如此类的问题。

So it used to be when we started out, you know, there were many people who said, oh, would would enterprises actually wanna do machine learning? Do they even have the expertise? Like, is there anyone there who could help them? I You believe it could help them, but is there anyone there? And so on.

Speaker 1

或者这会不会只适用于那些古怪的互联网公司？但现在所有大型企业都组建了数据科学与机器学习团队，他们正在打造产品。最初的一两个模型极可能已投入生产，有些企业甚至已有数百个模型在运行，并且拥有专业团队。

Or will it be this thing for, like, only these weird web companies? But now all, you know, major companies have hired data science and machine learning teams, and they're building products. Their their first one or two models are very likely in production. For some of them, it's already hundreds of models. And they have the team.

Speaker 1

他们与业务数据建立了连接，并计划构建更多模型。这些技术在众多领域创造巨大价值。比如某客户经常谈论他们利用Databricks平台做的事情——这家化工企业生产各种化学品。

They have the connection to the business data, and they're looking to build more. And these things provide tremendous value in a lot of areas. Like, for example, one customer that's talked a bunch about things they do with Databricks' shell. Right? They produce, you know, all these chemicals.

Speaker 1

他们拥有全套生产设施来处理这些化学品。储存化学品成本高昂，若生产出来却无人使用就会造成浪费，还可能引发各种问题，这种情况对企业非常不利。

They have all these, like, you know, these these plants that can process them. Storing chemicals is expensive. You know, if you produce it and no one's gonna use it, it's bad. All kinds of things can break down. If that happens, it's really bad for you.

Speaker 1

因此任何能帮助预测需求、优化工厂工况、监测异常的技术都能产生数百万美元的价值。从这些设备收集精细的时间序列数据如此简单，何乐而不为呢？他们现已实现全自动AutoML处理多数用例，机器学习团队只需在出现故障时介入。这仅是众多案例之一。

And so anything you can do that helps you estimate demand, optimize, you know, conditions in the plant, monitor, detect anomalies, has many millions of dollars worth of impact. And it's so easy to just collect the fine grained time series data from these things, so why not why not do it? Right? And so they're at the point where they're basically, you know, doing a lot of use cases through fully automated, like, AutoML, where, like, the ML team only has to look at it when something's broken. And that's just one example, but it's like anything.

Speaker 1

无论是乳制品企业还是建筑公司，各行各业都在应用这项技术。这是重大转变——如今在这个领域，你能找到大量拥有ML团队并开展实践的企业。他们现在共同面临的问题是：在完成最初几个用例后，虽然投入巨大，但还有上百个新用例构想，如何更高效地实现并投产？

Like, companies that produce milk, you know, companies that build houses, like, whatever it is Right. They're using it. So that's, like, one change. You know, you you can now if you do something in this space, you can find lots of enterprises who have ML teams and are doing things. And, of course, the next wave of that that they're all asking is, okay.

Speaker 1

我认为谁能破解这个难题，谁就将定义下一代平台。这正是我们通过MLflow等开源项目及SaaS服务努力实现的目标。

We did the first few use cases. It was a lot of work, but now we have ideas for, like, a 100 more use cases. So how do we make it way easier to create new ones and bring them to production? And I think whoever, you know, cracks that code is going to, you know, determine sort of the next platform. And that's that's the kind of thing we're, of course, trying to do through open source projects like MLflow and through the SaaS services we have.

Speaker 1

这是一方面。另一方面如你提到的ChatGPT大模型，深度学习确实备受关注。在企业界，人们对自然语言处理尤其兴奋，因为相关数据量庞大，能直接用于构建完整产品。现在几乎所有软件公司都在考虑以某种形式将其集成到用户界面中，同时计算机视觉等技术也日益重要。

So that's one thing. And then the other thing, yeah, you mentioned ChatGPD LLM. So deep learning is definitely on people's minds. And I think, again, within enterprises, I think people are especially excited about natural language just because there is so much data with that, that, and there are entire products you can make using it. You know, I I I see, like, virtually all software companies now, for example, thinking of including it in the UI in some form, but I also think other things like computer vision increasingly important.

Speaker 1

我们已看到相关趋势。ChatGPT真正有趣之处在于它让更多人尝试基础模型——虽然GPT-3已存在多年，更易操作的GPT-3.5也出现许久，但此前缺乏便捷的网页界面供人体验。现在更多开发者开始思考并构建这类应用，这非常棒。

We're starting to see some things like that. And the really interesting thing with ChadGPT is I think it just opened up, like, it got more people to try these foundation models. Like, basically, GPD three had been there for years, could do these things in stock g p t, which is easier for people to prompt, also had been there for a while. But there was no, like, nice web interface where you could just try it out. So I think now a lot more developers are thinking about it and and trying to build these applications, which is really cool.

Speaker 2

确实。随着更多开发者尝试基于大语言模型构建应用，这会改变你们对MLflow路线图或Databricks工具开发的思考吗？在您提到的两个方向之间——像壳牌这样的企业定制模型，与基于大语言模型构建支持型产品——你们如何规划对企业客户的全方位支持？

Right. Does the rise of, like, you know, more developers trying to build on top of, like, large, you know, language models, does that change at all kind of how you think about, you know, the road map at MLflow or kind of the tools you guys build at Databricks? Or is it kind of, you know yeah. How are you kind of thinking about, you know, of that that first world that you talked about, the kind of specific models for a company like Shell versus, you know, the the kind of let's build a support product on top of a large language model? You know, how are you guys thinking about supporting enterprises across those?

Speaker 1

是的，这很棒。我认为现在有很多大型语言模型（LLMs），它们有时被称为基础模型。我是斯坦福团队的一员，我们实际上尝试提出一个...

Yeah. It's a good so I think there are so I think these the LLMs out there, they're they're sometimes called foundation models. I'm part of I'm actually part of the team at Stanford that, like, tried to, you know, come up with a with a

Speaker 2

你们创造的这个词影响力很大。感觉现在所有人都在用

It's a very impactful term you guys coined. Feel like everyone used

Speaker 1

这个术语有其利弊。有些人不喜欢它，因为它听起来像是你必须基于这些模型构建。但这并非本意。关键在于预训练模型可以适配不同使用场景的理念非常强大。不过我也发现，每个公司和产品的应用方式都不同。

I mean, you know, the the pros and cons of this term. Some people don't like it because it sounds like you you have to build on one of these. That wasn't really the intent. But the idea of the psych pretrained model that you can then adapt to use cases is is very powerful. At the same time, though, I'm seeing that you need for for every company and every product, like, their application of it is different.

Speaker 1

要获得最佳质量，就需要定制化。所以人们正在探索解决方案——有人直接使用模型预测结果，然后通过微调或从头训练自己的模型（比如用模型标注数据）；也有人研究巧妙的产品集成方式，比如GitHub Copilot那篇关于如何选择最佳上下文提示以提高代码补全准确性的文章。

And to get the best quality, you will want to customize it. And so people are trying to figure out what to do. I mean, some people are getting predictions from these models and then actually, like, training their own model, you know, by fine tuning these or even by just training their own from scratch, like just using them to label stuff. Some people are looking at really clever ways to incorporate it in their products. For example, there's this nice post on, GitHub Copilot, like, how it decides what context to send with your prompt so so that it has the highest chance of knowing about your code and completing it.

Speaker 1

还有人设计能收集有效反馈的用户界面。虽然仍需大量工程工作，但这些模型首次实现了理解语言意图和模糊语义的基础能力，而且试错成本极低，这很棒。

You know, some people are designing, like, user interfaces that actually can collect, you know, reasonable feedback and and make things happen. So I think there's still a lot of engineering, but, these models let you do some simple like, some basic capabilities of, like, understanding language, intent, and fuzzy sort of semantics that you you couldn't do before. And and they're also super easy to try out, which is great.

Speaker 2

确实。围绕基础模型涌现的MLOps公司很有趣，比如专注提示工程、上下文增强或模型串联的。我好奇的是——你们肯定思考过——这些需求有多少是GPT-3当前阶段的临时方案，又有多少会成为长期必备组件？

Yeah. No. It's it's interesting. I mean, I think, you know, some of, like, the MLOps companies that are emerging around the foundation models themselves, you know, it feels like it's things like around prompt engineering or around providing more con you know, providing more context and prompts or chaining models together. And and one thing I'm curious about, and I'm sure you guys think about this is, it's hard to know how much of that is point in time for, like, GPT three today, like, that's required versus how much of this is always gonna be a fixture of, like, working with these models.

Speaker 2

你们在构建产品时如何区分哪些是基础模型现阶段的临时问题，哪些会成为MLOps技术栈的永久组成部分？

And so, you you guys must think about this in terms of what you build. Like, how do you think about, I guess, how of of these problems today are point in time for where foundation models are right now versus, like, permanent fixtures of kind of an MLOps stack that you might need for them long term?

Speaker 1

好问题。关键要区分算法特需方案和通用需求。Databricks更关注构建长期通用的平台组件——

Yeah. Great question. Yeah. I think, yeah, it's it's important to separate things that are very algorithm or technique specific from general things you need. And for Databricks as a company, at least, we've focused on figuring out, you know, what are the the platform pieces that are really general and will be useful later.

Speaker 1

例如模型部署后都需要：收集性能数据、进行AB测试、分析日志（数据量可能很大）。我们要简化这个流程——每次部署自动生成服务记录表，支持版本对比和可复现性。

So, like, what do I mean by that? So, for example, one thing is whatever you do with your model, you're gonna deploy it, and you're gonna wanna collect back data on how it's doing or maybe AB test different versions and easily analyze that data. Could be lots of data because you wanna log, you know, everything you can to to improve it. So making that workflow super easy. So like anytime you deploy the model, just get a table that is, you know, here's all the stuff that it served.

Speaker 1

无论使用scikit-learn、PyTorch还是API调用提示工程，这些需求都存在且实现不易。当然也可以针对特定模型开发专项产品，但通用平台建设空间更大。

Or you'll want to have different versions of it and compare them, or you'll want to have reproducibility of seeing what went into it. I think that will be important whether you're running, you know, like scikit learn or PyTorch or you're calling an API for the model with a prompt. You still need all these things, and they're not easy to do. It's not easy to figure out the right workflows, so there's a lot to do there. Of course, you can also do something more specialized for one type of model or one application, and you can have really great products there.

Speaker 1

我认为，最成功的案例往往具有广泛适用性。比如那些让你更全面审视计算机视觉的技术，或是能轻松为新型物体构建检测器的设计工具。这类成果很可能被广泛采用，但仍有待观察。关于提示工程这个具体方向——虽然你们稍后可能也会问到——我在斯坦福的研究正聚焦于此：如何用语言模型构建复杂应用。我们开发了一个名为‘演示-搜索-预测’（DSP）的编程范式，已开源部分代码，并将这些组件视为基础构建模块。

I think, again, some of the most successful ones are ones that are pretty broad, like, for example, things that let you look at computer vision more in general and, like, maybe, you know, design like, easily build an object detector for a new kind of object or something like that. Those are, you know, likely to be widely used, but it remains to be seen. I should also say on this prompting aspect specifically, I think maybe you folks would have prompted me to talk about it later too, but yeah, some of my research at Stanford is on this now, is on how to build complex applications using language models. So we have this programming model called demonstrate, search, predict, or DSP. We have some open source code, and we're also looking at these things as a building block.

Speaker 1

我们采取的方法可能与某些公司不同：只在语言模型最擅长的领域（如模糊匹配）使用它，同时尽可能保持对系统的程序化控制。具体来说，你可以用Python编写流程步骤，部分环节再调用语言模型。这样既能微调模型，又通过限制其作用范围来确保每个环节的高准确率，最终使整体系统保持可靠性。

And actually, the approach we've taken that's maybe a bit different from some companies is we want to use language model for things it's uniquely good at, like approximate matching or things like that. But as much as possible, we want to retain programmatic control of what the system does. So we have something where, like, you basically write a pipeline of steps in Python, and then some of the steps call into NLM. So you can also basically fine tune it and so on. But we're trying to limit its scope to test that will be basically really easy for it so that the overall accuracy as you combine many of these is still good.

Speaker 1

否则，如果完全依赖它执行复杂任务——比如制定问题解答计划并逐步执行——一旦初始计划出错就会全盘失败。不同团队有不同理念，当然也有人反其道而行。但从工程角度，我们追求99%的可靠性：需要精确计算每个环节的失败概率。

Because otherwise, if you bet on it doing one thing, like, say, come up with a plan to answer this question and then follow each step, and it comes up with the wrong plan, then you're toast. It's never going to succeed. So there'll be different philosophies. I'm sure there'll be people doing the opposite too, but we're looking at it from an engineering point of view of like, we want something that's, you know, 99% reliable. Like, you know, what's the failure probability of each stage that we want to get it to work?

Speaker 0

确实很酷。DSP项目早在我们关注列表中，大家对此都很兴奋，本来也准备稍后请教你这个问题。

Yeah. It's super cool. And DSP is definitely one of the projects that that was on our radar, and, it seems like one that that people are are really excited about. So we were gonna prompt you on it, later.

Speaker 2

绝非双关（no pun intended）。

No pun intended.

Speaker 0

确实不是双关。这领域非常有趣——关于人类如何与LLM交互，以及任务分配问题。DSP似乎正将工作从LLM转移到代码层面？当人们用Python脚本编排LLM调用时，可以这样理解吗？

No pun intended. Yeah. It's it it seems like a really interesting area of research in terms of how people are interfacing with LLMs and what work is done by the LLM versus in in code. It seems like with DSP, it's actually pulling work out of the LLM and into code. Is that the right way to think about it when people are kind of writing these Python scripts to orchestrate Yeah.

Speaker 1

广义上说，我们主张选择性调用模型。比如确定要执行A或B操作时，就以受限方式使用模型（例如生成同义词后再检索），而非抛出一个复杂问题指望它解决所有环节。这本质上是个工程挑战。

LLM calls? More generally, as I said, we wanna have, you know, some things you can call, but, like, some steps. Or if you're pretty sure you wanna do one one thing or another, like, you then use the model at, you know, in a in a limited capacity. Like, say, generate synonyms for this word or whatever, then I'll search for them or something like that as opposed to, like, asking a very complicated question and hoping that it figures everything out. So it's just sort of a challenge.

Speaker 1

但更深层的问题是：当前大模型将语义理解/推理与世界知识耦合在一起。比如问‘美国总统是谁’——今天ChatGPT可能答对，但四年后必然错误。实际上...（停顿）我不确定现在模型是否更新了这个信息，也许长期存在滞后问题。

But there are you know, it remains to be seen. I mean, the other factor this is getting at that's maybe kind of interesting is right now these large language models bake together and combine both what I'll call like semantic understanding or reasoning and facts like knowledge about the world. So for example, like, who's the president of The US? Well, if you ask Chad GPD, it it it you know, I mean, today, maybe it gives the right answer, but, like, in four years, it'll be wrong. Right?

Speaker 1

理想情况是将二者解耦。但当你用海量数据训练模型获得语义理解能力后，很难再让它‘忘记’那些事实性知识。试想某个事实突然改变，模型能否动态调整答案？

Actually, I I don't I don't remember what it is. Maybe they got it. Maybe they used to stay drunk for a long time. Maybe it's not good. But those things, like, you'd like to decouple them, so we also wanna look for a way to separate these.

Speaker 1

这确实很困难。就好比突然世界某个设定改变了，现在你还能正确回答我的问题吗？

And it's actually kinda hard if you train a model on tons of data to give it the semantic understanding. It's really hard to, like, then make it forget those facts. Like Yeah. You know, now, like, just imagine a world where, like, suddenly, you know, this thing changed. Now can you answer my question?

Speaker 1

所以

Speaker 0

我觉得这非常有趣，因为这确实是个大问题——知识会被编码进模型权重里，更新起来可不容易，对吧？嗯。你认为未来会不会有更多传统搜索索引来编码知识，然后由更擅长生成的部分来产生回答？或者说，你觉得传统信息检索与大型语言模型未来会如何协作？

I think that's super interesting because it is it is a big problem that then this knowledge would be encoded in the model weights, you'd have to, you know not easy to update. Right? Like Mhmm. Do you think in the future there will be more of a traditional search index that is encoding kind of knowledge and then something that is, you know, more generative generating the response? Or, like like, how do you see kind of traditional information retrieval working with LLMs in the future?

Speaker 1

我认为这是个吸引人的架构，而且我们需要有强有力的理由才不去采用它。至少在那些知识密集型应用中——无论是公共知识还是像公司内部的支持工单这类私有知识——这种架构很有意义，你会希望接入这些资源。虽然如何最优构建这些系统还有待探讨，但即便是用现有工具简单组合，也能获得令人印象深刻的演示效果和质量。

I I think that's an attractive architecture, and, you know, we we need, like, strong reasons not to do it. But I think at least in what I'd call knowledge intensive applications where there's some either, like, public knowledge or even private knowledge, like, you know, support tickets in my company or stuff like that, it makes sense, and you want to be able to plug those in. And there's a question of how to best build these, but you can get, you know, very impressive, like, you know, demos and quality even even in, like, fairly easy ways of combining current tools.

Speaker 0

非常酷。关于这点我最后想说的是，似乎还存在一些有趣的可能性——比如你的DSP程序由语言模型来指导编排。你觉得未来这类程序会完全由语言模型编写吗？还是说这个抽象层级更适合人类来编写程序，然后通过这种方式与语言模型交互？

Very cool. The last thing I wanna touch on this point is there also seems to be some interesting things where, you know, you have your DSP program, but then you also have a language model telling you how to orchestrate a DSP type program. Like, do you see do you see those being written by a language model in the future, or do you think that's kind of the right level of abstraction where it's like human takes over writing this program and then, you know, interfacing with the LMS that way?

Speaker 1

由语言模型生成的内容。我认为它们最棒的新应用场景之一，就是擅长生成那些冗长复杂的答案或示例。比如当人们问ChatGPT是否能取代谷歌时，我认为以目前形态它更像是Stack Overflow或Quora这类平台的替代品，而非谷歌的替代品。

Things produced by a language model. And I think one of the the best kinda new use cases for these is and and they are good at generating, like, long, you know, kind of complicated answers or examples or whatever for something. That's like, for example, when people ask, like, hey. Is is ChatGPD like a a Google replacement? I would say it's not so much a Google replacement in its current form as a Stack Overflow or, like, Quora or whatever type of replacement.

Speaker 1

当然，即便是由搜索引擎增强的聊天GPT，我认为最有趣的部分是当你要求它生成大量内容时。因此，它在教授人们编程方面非常强大。实际上，当我们考虑在Databricks中时，可以想象我们也在构建利用这些语言模型的方式，比如在我们的产品中。我们最兴奋的一点是，这将使产品对人们来说更容易学习，对吧？

Of course, the the and even the chat GPD augmented by a search engine, I think the most interesting bits are the ones where you're asking it to generate a lot of stuff. So it is very powerful for teaching people programming. Actually, when we think about, like, in Databricks, can imagine we're also, you know, building, you know, ways to use these language models, like, in our product. And one of the things we're most excited about is it will make the product easier to learn for people. Right?

Speaker 1

比如，我们致力于扩大数据访问范围，让更多非专业人士也能使用。让他们做酷炫的事情，而作为云服务托管是一种方式，但这种能提供可立即使用或可编辑内容的自然语言交互则是另一种方式。所以我认为这将是一个有趣的互动过程。另外我想说的是，如果你想着‘好吧，我要让语言模型为我生成一些代码，然后希望代码是正确的’。

Like, we are we're all about, like, expanding access to data to, like, more people, you know, non experts. Like, let them do cool things and, you know, hosting things as a cloud service is one approach, but this kind of, like, a natural language that gives you something that's, like, maybe usable or you can edit from is is another one. So I think there'll be an interesting back and forth. And the other thing I'll say about that is if you think about, like, oh, okay. I'm gonna tell a language model to generate some code for me, and then I hope the code is correct.

Speaker 1

对吧？比如它是Python代码，我就直接运行它。但如果你不懂Python编程，要让这奏效其实挺难的，明白吗？

Right? Like, it's Python or something. I'm just gonna run it. If you don't know how to program Python, like, it's it's kinda hard to make that work. You know?

Speaker 1

如果你在学习阶段，这很好。但你也可以尝试让它生成其他东西。比如，如果它生成了一个拖拽式的UI界面，就像在Figma这类工具中那样呢？对吧？这样你就能直观看到它。

It's it's good if you're if you're learning. But you could also try getting it to generate something else. Like, what if it generated, you know, a drag and drop, like, UI in a in a tool like Figma or something like that. Right? Then you see it.

Speaker 1

你可以编辑它。你不需要成为专家也能判断它是对是错。如果它生成的是那种可视化的数据流线呢？比如生成一个解析转换数据的管道，然后你只需点击就能查看它。

You can edit it. You you know, you don't need to be an expert to tell whether it's it's correct or wrong. What if it generated, like, one of these visual, like, data flow lines? Right? Like, generate a pipeline to, like, parse and transform these things, and then you can just, like, click on it and view it.

Speaker 1

所以它不必生成那种晦涩难懂的编程语言也能很有用。

So it doesn't have to generate, like, an inscrutable, you know, programming language to be useful.

Speaker 0

完全同意。

Totally.

Speaker 2

没错。这确实很有趣。你提到可能将这些模型用在Databricks产品中，使其对用户更友好。我想知道，你具体是如何考虑的，或者实际会如何整合这类模型？

No. It's really interesting. You mentioned, you know, potentially using these models, like, within the Datapricks product to make it more accessible to folks. Like, I I guess, what are some examples of of how you're thinking about that or or how you might actually end up incorporating these types of models?

Speaker 1

是的，这类应用场景很多。我是说，很多公司都在这么做。比如任何可以输入查询的地方，无论是用Python还是SQL，或许都能改用自然语言输入。这只是个例子，还有其他应用。

Yeah. There there are a lot of these. I mean and and, you know, many many companies are doing this, but, certainly, like, any place you could like, for example, any place you can type a query, right, in in in, like, Python or SQL, Maybe you could type it in natural language. That's an example. But there are also others.

Speaker 1

比如搜索数据集或文档时，为什么不通过更好的模型来实现？另一个简单应用是自动填充推荐，就像在Google Drive分享文件时，添加一个人后系统会建议可能还想分享给其他人。

Like, what if I'm doing a search either over my data sets or over the documentation? Right? Like, know, why not let me do that through a better model? Another simple one that many companies are doing is just basically autofill a recommendation. So you know how when you, like, for example, when you share a document in Google Drive and you add one person, it suggests, hey, maybe you also want to share with these other people.

Speaker 1

这很容易实现。我们合作的一些软件公司甚至在ChatGPT之前就在平台上做机器学习。他们想做得更多，我们问为什么，他们说应用里有上千个文本字段，每天有数百万用户使用，希望每个字段都能根据客户部署实现自动补全。

That's, like, really easy to do. So, like, some of the software companies we work with, even before ChatGPD, you know, they were doing machine learning on the platform. They wanted to do a lot more, and we asked them why, and they say, well, look, I have this application. It's got, like, a thousand text fields on it, and there are millions of users using this every day. I want every single text field in there to have autocomplete, like, based on that customer's deployment.

Speaker 1

这能让应用更受欢迎。所以我希望未来开发者能轻松添加这类提升体验的功能。比如在HTML中添加文本字段时，只需简单设置就能基于其他字段实现智能填充。

So how how can I get that? Right? It'll it'll make everyone like my application better and and all that. So, yeah, so I think, there are quite a few, and I'm definitely hoping we get to a place where these, like, kinda quality of life things are very easy to add for any developer. Like, you know, imagine I'm adding a text field in, like, HTML.

Speaker 1

而对于需要精心打造的功能，比如自动驾驶汽车，就不能只靠一键AutoML，需要最好的工具持续优化，确保99.99%的可靠性。

I could just put, like, you know, LORD equals, like, you know, based on these other fields, basically. And then the ones where, like, you really wanna spend time, like, hands on, build, like, the best quality thing you can, you know, invest lots of engineering, you have nice tools where ML engineers can just focus on that. So, you know, like, when you're building, like, your self driving car or whatever, you're you're not gonna do that with just, like, one click AutoML. You you want the best tools to keep cracking away at it and, like, make sure it's really, like, good 99.99% of the time.

Speaker 0

太棒了。我们换个话题，聊聊LLM领域的最新进展和模型训练吧。

So Awesome. Would love to shift gears and talk a little bit about kind of the latest in the LLM world and and training those models.

Speaker 1

嗯。

Mhmm.

Speaker 0

看起来Transformer架构正逐渐成为多个行业实际采用的标准架构。这很有趣，因为虽然大家都在使用相同的架构，但关键在于如何实现规模化。你作为系统专家，正是规模化领域的权威。我想请教，你如何看待建模专业能力与系统专业能力的重要性对比？以及在训练、服务部署等环节的规模化问题上，你认为这两者的重要性会如何随时间演变？

It it seems like the transformer architecture is kind of becoming the de facto architecture that's used, you know, across a bunch of industries. And so then it's interesting because it seems like people are using the same architecture, but then it's about how you can actually scale these. And, like, you're you're a systems person, and so you're, like, the expert on on scale. I guess, how do you think about the importance of kind of modeling expertise versus systems expertise and and how you scale, you know, the training and the serving and and that sort of thing? And, like, how do you think, you know, the importance of those might change over time?

Speaker 1

这是个很好的问题。两者都很重要。规模化的迷人之处在于——我想这也是许多人选择这条路的原因——你不需要绞尽脑汁就能获得性能提升，对吧？

Yeah. It's a great question. I mean, they're both important. The really interesting thing about scale, and I think the reason a lot of folks went for it, is you don't have to maybe think that hard, I guess, to to to get gains. Right?

Speaker 1

确实，构建系统的工程师需要深思熟虑。但如果你设计出某个模型架构或训练流程（比如SGD之类表现良好的方法），然后投入10倍的机器或数据量就能获得更好效果，这简直太棒了。所以很多人选择这条捷径：既然能获得数百万美元投资，与其费心钻研基础原理，不如直接购买更多GPU、投入更多数据来尝试。

Like Yes. You know, to put I mean, the systems people building the system have to think, but if you if you build, like, some model architecture, some training process, like SGD or whatever that, you know, that seems to do well, and you can, like, throw in 10 x more machines at it or, like, 10 x more data, and it does even better, that's, like, fantastic. So I think a lot of people are going down this out and saying, hey. I I can get, you know, x million dollars in investment. Like, am doing a lot of things that involve thinking hard and, like, you know, like doing, you know, trying to understand from the fundamentals, but why don't I also do, like, why don't we just buy more GPUs and, like, put in more data and, like, try to do this thing?

Speaker 1

目前很多企业都采取这种策略。但如今语言模型的规模已大到只有极少数公司能复现，这对研究者确实是个问题。我认为这实际上会延缓创新——因为能参与探索新想法的人变少了。或许未来会出现更易上手的新方法，催生更优秀的衍生理念和长期研究。不过同时也有许多人在努力降低规模化成本，让更多人能参与其中。

So I think a lot of folks have gone down that way. Now, of course, today with language models, it's at the point where very few companies can even, like, reproduce the larger scale ones, and for researchers, that's certainly a problem. It actually means, I think, that the research will or the innovation will kind of slow down because fewer people can be chipping away at it and, like, you know, trying out new ideas. So maybe there'll be some new thing that's even you know, that's easier for people to play with that leads to, like, better sort of, you know, descendant ideas and and better long term research. But there are it also means that there are enough people working on making it cheaper, making it easy for anyone to do stuff at scale that, you know, there'll be activity on that.

Speaker 1

如果这种方法确实适用于多数场景，其成本其实仍然很低——即使每次预测需要1美元GPU时间，在某些应用中也完全值得。但我发现，在需要极高精度的特定场景（比如客服工单处理），这就变成了传统机器学习工程问题：你必须考虑是否具备正确的验证数据？输入是否恰当？

And if this is the method that works well, you know, for many applications, like, it's still very inexpensive. Like, you know, even if serving a prediction costs you, like, a dollar or something of GPU time, in in some applications, it's, like, totally worth it. So people are doing it. What I found, though, is for, at least for very targeted applications where you want very high quality, like, you know, like, say, answering a support ticket or something you you don't wanna be on, that becomes a whole like, that becomes kind of a traditional ML engineering process where you have to think, do I have the right validation data? Do I have the right inputs?

Speaker 1

如何将这个版本与另一个版本进行对比检查？基础模型只是其中的一部分。拥有合适的工具、正确的流程、运维流程以及能观察运行状况的工具同样至关重要。

How do I check this version against this other version and so on? So the foundation models are only one piece of that. And having the right tool and having the right, like, process, the ops process, and the right tools to, like, see what's going on is also important.

Speaker 0

是的，这很有道理。在成本方面，你提到的训练与推理成本差异很有意思。看起来训练成本在过去几年确实大幅下降——我记得2020年有估算显示...

Yeah. It makes a lot of sense. On on the cost side, it's fascinating what you were saying around the training versus versus serving. And it seems like the training costs have gone down a lot over, you know, the even just the last couple of years. I think there was one estimate that it was in in 2020.

Speaker 0

有人估算仅计算资源就花费了500万美元训练GPT-3。而Mosaic最近发布的博客称，现在用不到50万美元就能训练出质量相当（规模稍小）的GPT级模型。这意味着两年内成本下降了10倍。你认为这个趋势会持续吗？未来两年会再降10倍？还是说我们会遇到某些技术极限...

Someone estimated that on compute alone, it was $5,000,000 to train GPT three, And, Mosaic just published a blog that you could now train a GPT quality model. So it's a little bit smaller, but quality model for less than 500 k. So it's like a 10 x reduction in in two years. Do you think that trend will continue and we'll see another 10x reduction in the next two years? Or do you think that, you know, maybe there'll be some some sort of limits that we're hitting up in terms of Mhmm.

Speaker 0

毕竟早期人们肯定优先摘取了那些低垂的果实。

Because I bet there's some low hanging fruit that people were going after early.

Speaker 1

没错。我认为现有模型的成本优化仍有很大空间。如果仅基于当前模型架构追求更低成本，确实存在可能性。现在有很多公司研发加速器，这领域将形成激烈竞争——特别是如果最终主流仍是Transformer架构。对吧？

Yeah. I think there's still a lot of space to improve costs for the current model. So if you just wanna take today's model architecture and, like, do it more cheaply, there is space. There are all these companies, working on accelerators that will presumably generate a lot of competition there, especially if it ends up being a lot of transformers. Right?

Speaker 1

对他们中的一些人来说，走这条路风险很大，因为如果所选的架构发生变化，他们就会陷入困境。但总会有人这么做。同时，训练过程本身也有新思路或新发展，比如使用更大的数据集、更干净的数据集、不同的惩罚机制。人类反馈机制表明，仅靠预测下一个词的纯语言模型其实很难用，而这种指令微调技术让它实用多了。明白吗？

It's just risky for some of them to go down that path because then if the architecture, you know, of choice changes, then then they're kinda stuck. But there will be some who who do that. And there are also, you know, new ideas or new developments about the training process itself, like having bigger data sets, maybe having cleaner data sets, having different penalties. Human feedback stuff, so it turns out just a a language model alone that just predicts the next word is kinda hard to use, but this instruction tuning thing makes it much easier to use. You know?

Speaker 1

嗯。算法层面还有很多改进空间，可能达到相似的质量。此外还有像模型压缩这样的通用技术——比如稳定扩散模型最终被压缩到能在笔记本或手机上运行的程度，这恐怕是初期没预料到的。所以今天的模型肯定会变得更实用。

Mhmm. There, there's a lot of room to do algorithmic things that might give you similar quality. And, and then there are also pretty general techniques like model compression. Like, for example, with stable diffusion, they ended up compressing it to the point where you can run it on, like, a laptop or a phone or something, which was, you know, wasn't anticipated, I think, at the beginning. So yeah, I think today's models will become way more practical to use.

Speaker 1

我的问题是：如果加入更多参数，它们能提升多少？这又回到了知识储备与语义推理的问题——如果能区分这两者，或许就不需要增加参数。有趣的是，在计算机视觉领域，我们没有类似GPT-3的巨无霸模型，现有中等规模的模型就能满足各种需求，没人会为提升性能而增加百倍参数。但在语言模型领域我们却这样做了，其中多少是为了推理能力，多少是为了知识储备？

The question I have is how much better do they get if you put in way more parameters? And again, I think it also is tied to this knowledge versus semantic reasoning question, and if you could separate those two, then maybe you wouldn't need to put in more parameters because one interesting thing is, in computer vision, we don't have, like, a GPE-three. We have excellent models. You can use them for pretty much anything you want that are sort of moderate sized, and we haven't had like, one's adding, like, you know, 100x more parameters to make it better. In language, we do have this, but how much of it is for the reasoning versus the knowledge stuff?

Speaker 1

作为人类，我可以编个故事，构建一个自带物理法则和派系的科幻宇宙，并设定规则。而你听完就能据此推理，比如判断反抗军就该炸毁死星之类的。根本不需要花大量时间学习背景知识。

You as a human, I can tell you I can make up a story. I can make up sci fi universe with my own physics laws or my own factions and tell you the rules about them. And, like, you can then reason about them and tell them, like, you know, like, the rebel should definitely blow up the death star or, like, whatever. You know? You don't need to, like, spend lots of time, like, learning about it.

Speaker 1

所以问题在于：现有连贯模型已经开发了多少基础推理能力？我们还能获取多少潜力？

So the question is how much of that basic, like, reasoning capacity do you have we maxed out with the coherent models, and how much more do we or how much are we gonna get? Yeah.

Speaker 0

确实厉害。iPhone上跑稳定扩散，最近好像还出了Lama模型？这类单GPU就能运行的模型，肯定能催生像你刚才说的那种有趣应用场景。

Yeah. It's really cool. The stable diffusion running on the iPhone, I think, Lama just came out. Or one of one of those, you can run on a single GPU, and I can imagine there's just really interesting use cases exactly what you were saying.

Speaker 1

对。Bama算是精简版Transformer，但更注重训练过程，使得它在标准学术基准测试上的质量接近GPT-3。所以...

Yeah. Bama is like a smaller transformer, but more care going into the training process that makes the quality, at least on the standard, like, academic benchmarks that similar to g p d three. So So

Speaker 0

太酷了。

Super cool. Yeah.

Speaker 2

听着关于模型规模的讨论，我很好奇一件事：当前生态中，既有你所说的闭源大型基础模型公司，也有开源替代方案。作为开源领域的资深人士，你如何看待开源方案与闭源方案的未来采用比例？

I'm curious given this discussion we're having about, like, how big these models need to get. You know, one thing I'm struck by of of kind of momentum in this space right now is, obviously, you have the the large foundation model companies, to use your terms, that are that are kinda closed source, then you've got kinda the open source alternatives. And obviously, you've spent a lot of time in the open source world. You know, curious how you kinda see this this playing out in terms of the kind of extent of of adoption of the open source solutions versus the closed source solutions.

Speaker 1

好问题。这里涉及多个因素，但预测很难。我可以分析闭源服务模式的利弊——支持方的论点主要是SaaS模式更易用。

Yeah. It's a great question. I think there are a few factors here, and, again, it's it's really hard to make predictions. So I can I can give you, like, both the sort of the arguments for and against the the the closed source service approach? So one of the arguments for would be that it's just easier to use, so for ex because it is a SaaS thing.

Speaker 1

举个例子，如果我想训练自己的模型来做某事，首先得搞到一些GPU。即使在云端也不容易大量获取，或者我可以购买然后等待。实际上我在斯坦福有一堆高性能GPU服务器，一个月前又订购了几台，现在还没到货，只能干等。这很麻烦。之后我还得安装软件。

So for example, like, if I wanted to train my own model to do something, I would have to first acquire some GPUs. Not super easy even in the in the cloud to get a lot of them, or I can buy them and, like, wait. Like, you know, I actually have, like, a bunch of, like, high powered GPU servers at Stanford, and I ordered a few more, like, a month ago, and, you know, they're still not here yet, so I just gotta wait for them. So, like, that's a pain. I then have to install software.

Speaker 1

不用GPU时还得考虑资源闲置问题。但使用托管服务就没有这些烦恼——我只需调用接口就能获得响应。如果服务商在后台优化性价比、升级CUDA版本甚至改进模型架构，我都能直接受益。这是个强有力的优势。

You know, I have to call it. If I'm not using the GPUs or whatever, you know, it's a it's an issue. So with with something that's hosted as a service, you don't have any of those. I can just make a call and get a answer back, and then if they're working behind the scenes to, like, improve its price performance or, like, upgrade to the latest version of CUDA or even improve the model itself, the architecture, I just get those benefits. So, like, that's one strong argument.

Speaker 1

这就像SaaS与传统软件包的对比。另一个优势是服务商可以通过海量客户数据建立护城河。不过这种数据优势的必要性还有待验证，毕竟这类模型的核心卖点是情境学习——不需要预训练大量数据就能直接适配新任务。

That's, like, the SaaS versus, you know, packaged software type argument. And then the other one would be if they can learn a lot across customers and get this sort of unbeatable data advantage by being really good at that. But and that, I think, you know, kind of remains to be seen, how much of that you need. Because the whole point of these, in a sense, the ad, like, on the TIN is, like, you know, no is basically, like, in context learning. Like, I don't need to, like, see a lot of stuff.

Speaker 1

这意味着可能不需要庞大数据量。但数据优势仍可能在某些领域（包括语言模型）成为决胜关键。反方观点则是：最佳模型架构和训练方法迭代极快，封闭系统会错过开源社区的创新红利。

You can just apply me to your new thing. But then at the same so that means, you know, you maybe you don't need that much data. But those could be the reason. And I think there'll be some areas, and it could could include language models where, like, you know, the best ones will be by companies that have that data moat. On the other hand, the counterarguments, like, one is, you know, the best models and the best training methods and stuff like that change very rapidly.

Speaker 1

如果你闭门造车，当某个研究生突然发布更优架构时，所有人都会转向新技术。你的Transformer可能很优秀，但缺乏社区支持就会落伍。这是封闭系统的风险。我认为基础要素更可能成为公共资源——就像当前模型主要基于公开数据集训练。

And if your thing is closed, you don't get the benefit of that open, you know, innovation community that's happening. Right? So you might invest super hard in, like, serving and and honing and and training a transformer. And then, like, some grad student or, like, undergrad somewhere these days is is publishes a new thing, that's better. And then everyone switches to that.

Speaker 1

这些资源有成为公共品的趋势，没有组织会想独自构建整个数据体系。我们会协作处理基础部分，在其他领域竞争。开源生态本质上也是如此。

And, like, no one maybe transformer was awesome too, but, like, no one else is working on your architecture to and you you lose out. So that's, like, one of the risks. So I think it will remain to be seen. And there is, I think, a tendency for more foundational things like public datasets, for example. All these things are trained in large part on public datasets.

Speaker 1

（沉默）

These things, there's a tendency for them to become, like, commons where, you know, no organization, like, wants to do that whole thing on their own data side. Okay. We'll collaborate on that, and we'll do something else. That's also what open source is, basically. So yeah.

Speaker 2

确实有趣——底层架构是否保持稳定？以及如你之前提到的，我们能否在模型中编码更多推理能力，而非单纯存储知识？这关系到是否需要投入数十亿美元计算资源才能打造尖端模型。

Yeah. It's really interesting, you know, whether the underlying architecture stayed the same, you know, as well as kind of, you know, alluding to the point you were mentioning earlier, like, whether we can find ways to, you know, encode more reasoning into these models beyond just, you know, having to stop every bit of knowledge in there as well. It seems to have implications for whether you're gonna need to throw billions of dollars at compute to kind of get a cutting edge model.

Speaker 0

嗯。之前聊到你们将DSP与信息检索结合的研究。能多谈谈Dawn实验室的使命吗？还有你们团队在研究的其他项目？

Mhmm. We talked a little bit about the research that you were doing with DSP and information retrieval and bringing those together. It'd be awesome to hear more about Dawn and kind of what the charter is there and maybe some of the other projects that you and your team are working on in the research world.

Speaker 1

Dawn是斯坦福的大型实验室，聚集了机器学习与系统领域的研究者。实际上这个三年期项目已经接近尾声，抱歉。

Yeah. So Dawn is this bigger lab at Stanford where a bunch of machine learning and systems folks got together. Actually, Don has basically just wrapped up. It was meant to be a three year project, so it's kind of wrapped up now. Sorry.

Speaker 1

抱歉。这原本是个五年计划项目。但我们召集了一批教职员工，研究机器学习民主化的不同方面，目标是打造一个能倾听业界声音的环境。我们举办了多场研讨会、静修会和类似活动，邀请行业人士参与，聆听斯坦福应用机器学习的研究者分享，并就不同主题交换想法。

Sorry. It was meant to be a five year project. But we brought together, like, a bunch of faculty, you know, looking at different aspects of and and the goal was to look at democratizing machine learning and to create this environment where we can hear from people in industry. We had a bunch of, like, workshops and retreats and events like that with folks from industry. We can hear from other people at Stanford who are applying ML, and we can also exchange ideas on different topics.

Speaker 1

比如DON实验室的另外两位教授Kunleira Lukertun和Chris Rui，前者参与了硬件公司SambaNova，后者的实验室孵化了数据导向型ML公司Snorkel。实际上这个群体还衍生出许多初创企业和其他人脉网络。能见证这些多元视角非常棒——通常硬件研究者、像我团队这样的并行算法开发者、以及算法设计师很难共处一室交流。

So for example, two of the other professors in DON, Kunleira Lukertun and Chris Rui, are involved in SambaNova, which is a hardware company. And then Chris is also involved his lab started Snorkel, which is, a a data data oriented ML company. And, you know, there are there are actually a whole bunch of, like, other startups or other connections with with different people from that group. So it's it was really nice for seeing these different perspectives. You don't often have people who work on hardware and people who work on parallelizing algorithms like my group and and then people who, like, actually design, you know, their algorithms in the same room.

Speaker 1

说到令人兴奋的事，如今已有不少从DON孵化的大型企业，包括Snorkel和SambaNova。还有许多开源项目获得实际应用，比如我们研发的检索模型ColdBear被DSP采用后，现已成为基于DNN信息检索的标准方案之一。

I think to to me, like, just trying to think of what's exciting, mean, there are a number of of actually, at this point, like, pretty large companies that were founded out of DON, including Snorkel and SambaNova. And I I think there are also a whole bunch of, like, open source projects that have reached, like, some amount of use or or models that have reached use in industry. Actually, ColdBear, which is the retrieval model we worked on that is used in DSP, was one of those that's like, we put out this model architecture, now it's becoming one of the kind of standard things people consider for DNN based information retrieval.

Speaker 0

确实。正如你提到的Snorkel等从斯坦福诞生的杰出企业，感觉这延续了AMP实验室和伯克利前辈们的传统。你们和伯克利之间存在竞争关系吗？

Yeah. There's been some incredible companies, as you're mentioning, Snorkel and a bunch of other ones that have come out of Stanford. And and I feel like that's, you know, kind of following the footsteps of the AMP lab and what the the folks at Berkeley did. Is is there any rivalry

Speaker 1

我的斯坦福同事会说我们一直也在做同样的事。不过...

My Stanford colleagues would say Stanford was As also doing that all along. But

Speaker 0

这倒是事实。

That's true.

Speaker 1

我很欣赏他们的大实验室模式。

I I like the big lab model there. Yeah.

Speaker 0

我好奇斯坦福和伯克利在研究领域是否存在 rivalry？你从伯克利来到斯坦福任教，会有人对此不满吗？

I was curious about is there is there a rivalry between Stanford and Cal for for research? Like, you you were at Cal, and then you became a professor at Stanford. Are people pissed about that?

Speaker 1

学术界通常希望博士生去不同机构交流思想，避免学术近亲繁殖。两校间有很多合作，但有趣的是历史上斯坦福被视为学术创业的摇篮，而现在我感觉伯克利师生对创业的热情反而更强烈。

I think people want their PhD students to go somewhere else and, like, to get, you know, to know the other folks, you know, like, and to get other ideas and so on. So, like, they don't want sort of, you know, like, inbreeding of ideas in the same place. So I think it's it's okay. A lot of people collaborate across the two, but it's definitely interesting because I think historically, Stanford was seen as, like, the main place for startups out of academia. But now I I think Berkeley is you know, I I see more people at Berkeley, honestly, interested in in starting companies than here.

Speaker 1

部分原因是Databricks等成功案例的示范效应，加上校友的持续推动——比如去年几位Databricks联合创始人和教授就专门为博士生开设了创业课程。

And partly is because there were a few large ones that started out of that and and like Databricks, and then there were alums that were really passionate about it. So, like, last year, for example, like, a few of the Databricks cofounders and other faculty there ran a course on basically, like, you know, doing a start up, a course for each PhD

Speaker 0

酷。

Cool.

Speaker 1

你知道的，就像研究生那样。我记得这门课大概有100人左右，所有人都在思考创业，这在这里我没见过。这里没有关于创业的计算机科学课程。你可以去其他院系选课，但计算机科学系应该没有这类课程。所以

You know, like graduate students. And I think there were, like, 100 people in the course or something, like, all thinking about it, which I have not seen here. There's no there's no CS course here on doing a start up. You you can go to other departments, but I don't think there's a CS course. So

Speaker 0

是啊，完全同意。另外有件趣事想问你。我们之前向Reynold Jin咨询另一件事时，他提到有段时间你花了很多精力帮助一家初创公司将Spark集成到他们的技术栈里，那还是Spark早期阶段。据说你忙活了好几周后，他们只给了你一张50美元的亚马逊礼品卡。

Yeah. Totally. One other fun thing I wanted to ask you about. Reynold Jin, when when we were asking him for for a different piece, he said that there was a time when you had done a ton of work helping a startup integrate Spark, this is in the early days of Spark, into their stack. And I guess after working on it for weeks, they gave you, like, a $50 Amazon gift card.

Speaker 0

这是怎么回事？

What what's the story there?

Speaker 1

是的，早期我确实帮助过几家初创公司，因为看到人们使用我们开发的东西让我特别兴奋。要知道，当软件还处于早期阶段时，对他们来说也存在风险。比如可能根本跑不起来。

Yeah. There there were a few that I helped early on because, you know, I wanted, I was really excited to see people using the the stuff we're building. And, you know, when you have a a piece of software that's early on, there's also risk for them. Right? Like, it might not work.

Speaker 1

他们可能浪费数周时间在无法运行的东西上，而且这毕竟出自大学项目，可能充满漏洞。说不定我转头就去搞别的项目，不再维护Spark了。所以确实存在风险。但通过与他们合作，我也学到了很多，类似的情况有好几次。

They might have wasted weeks on something that doesn't work or and, you know, it's out of a university. It might be buggy. Like, you know, maybe I go and do a different thing next and, like, stop working on Spark. So so there's some risk. So I was learning a lot by by working with them, and there were a few of these.

Speaker 1

那段经历真的很棒。那时候湾区还有更多线下聚会之类的活动，现在虽然开始恢复，但氛围还是不太一样。能参与那些小型社区、结识同好特别有意思。

It was it was really cool. Know? This is also, like, back then, I feel there were more in person, you know, meetups and things like that in The Bay Area. I think now they're they're they're still kind they're starting to come back, but a lot of stuff is is not exactly to the same level. So it was cool to have these, like, small communities and, you know, meet people there.

Speaker 1

我们曾在大数据和机器学习聚会上宣讲Spark，后来还定期举办Spark专场聚会。有些公司会主动申请主办，因为他们想招募懂Spark的人才。整个过程充满乐趣。当然还有各种其他故事。

We went we gave talks about Spark at at some of the the big data and machine learning meetups. And then later, we actually started our own regular Spark meetup, and, like, companies would ask to host it and stuff because they wanted to recruit people who who know Spark or who work on it or whatever. So it was it was fun. Yeah. But, yeah, there were all kinds of things.

Speaker 1

说实话，我也蹭过不少免费餐食。我其实没索取什么回报，能获得大家的时间、看到酷炫的应用，还能向他人讲述这些经历就足够了。

I mean, I I got you know, I also got, like, free food and stuff. I didn't I didn't really ask for much. It was just good to to get people's time and to see the cool apps and, honestly, to, you know, to, you know, be able to talk about them to others. One

Speaker 2

听你讲述Databricks创立故事时，有件事让我印象深刻。你们早期做出的重大决策就是完全押注云端方案，虽然当时...

thing I was struck by, you know, when you were talking about your Databricks founding story is, obviously, a big decision you guys made early on was deciding you were going to go cloud only. And I'm sure at the time

Speaker 1

是的。

Yeah.

Speaker 2

这感觉像是个重大风险。显然，它带来了巨大回报。但你是如何做出这个决定，并逐渐适应选择这条路线的？

That felt like a big risk. Obviously, it it it paid enormous dividends. But how do you make that decision and and kinda get comfortable deciding this is gonna be the way forward?

Speaker 1

是的。我认为有几个因素。首先从战略层面看，即便是短期战略——比如初创公司开发这类产品时，瞄准新工作负载或迁移中的工作负载，比试图取代企业里已深度整合的现有系统更可能成功。云服务在这方面很有优势。此外，云能提供快速反馈循环，减少软件版本碎片化。

Yeah. I I think there were a few things. So first of all, just the strategy, like, even the short term strategy of, like, you're starting a company, you're more likely to do well, I think, with this type of product if you go after new workloads or, like, workloads that are being migrated than if you go and try to displace an established thing that's, like, tied into, you know, a million other processes in that enterprise. And so the cloud was good for that. In addition, the cloud allows you to have a very fast feedback loop and less divergence of, like, the software out there.

Speaker 1

这样你就不需要维护10个软件版本，因为老客户拒绝升级——这是传统软件包的问题。在云环境中，你只需维护当前部署版本和正在更新的版本，最多同时存在两个版本。所以我们认为：快速反馈循环对产品开发至关重要，加上可以瞄准全新应用场景或迁移项目的机会。

So you don't have to support, like, 10 versions of the software because your old customers are refusing to upgrade or whatever, which was the issue for, like, packaged software. In in the cloud, you just have the version that's deployed now, and maybe there's a second version that's, like, rolling out when you do an update. So you only really need, like, two versions at a at a time. So we thought, okay. Fast feedback loop, which is needed to develop a great product, plus this opportunity to go after, you know, greenfield applications or migrations or, like, someone else.

Speaker 1

我们并没有说服企业迁移上云——那是亚马逊或微软的工作。但我们可以说：既然你要整体迁移，何不试试用我们的方案处理这个模块？

Like, we we didn't convince the that company to move to the cloud. You know, Amazon did or Microsoft did. But we say, look. If you're gonna forklift move everything, why don't you try this thing here for for this piece? Right?

Speaker 1

所以我们算是借势而为。长期来看，我们起步顺利，发展态势足够好，于是决定全力专注这个方向，而非其他领域。

So we we kinda piggyback on that. Yeah. Long term, though and and, you know, we started with it, and it was going well enough. And we were you know, that that we said, okay. We're just gonna focus entirely on this instead of going to somewhere else.

Speaker 1

理论上我们也可以先开发优秀产品，再通过其他模式销售。但最终没这么做。长期来看，SaaS化是大势所趋——对客户意味着更少运维负担，对供应商则能获得快速反馈。

We could have also in an alternate world, we could have used that to develop a really nice product and then sold that product somewhere else in a in another mode. Yeah. But we decided not to do that. But, yeah, in the long term, I think so much stuff is moving to, you know, to SaaS services. And both for customers and for providers, like, they're a good arrangement because for customers, they just fundamentally have less stuff to worry about and to administer.

Speaker 1

如果你推出的新产品不是SaaS模式（比如开源下载版或Docker容器），最大的竞争对手将会是提供类似功能的托管服务。

And for providers, you get the very fast feedback. So it's it's gonna be a major force. Like, I think the big competitor, if you are launching a new, you know, software product that is not a SaaS thing, like, say, it's just open source bit someone downloads or, you know, a thing you install a Docker container or something like, the big competitor is gonna be a managed service that does a similar thing. So

Speaker 2

我们访谈最后总有个快问快答环节。首先很好奇：在AI领域热议话题中，你认为哪项被过度炒作，哪项关注不足？

Well, we always like to end our interview with a a quick fire round. So we ask you a few quick questions, get your thoughts. And maybe to start, would just be curious. Obviously, there's a lot of things that are talked about on AI, Twitter, in the broader ecosystem. What's one thing you think's overhyped and one thing that's underhyped?

Speaker 1

我不确定是否要指认过度炒作的事物。但如果必须说，那些演示时能运行一次但不可靠的方案存在风险——很多领域做出50%-70%可用的东西很容易，但要完全可靠却非常困难。

Yeah. I I don't know if I wanna say anything is is overhyped. I mean, do think I would say okay, if I am to say one thing, would say, like, sort of demos that work, like, once but then the thing, you know, doesn't reliably happen are risky in some you know, depending what you're trying to do, right? So, like, it's, in many areas, it's, like, easy to make something that works, you know, 50% of the time, sixty, seventy, but it's, like, really hard to close that gap. So you you just have to be careful about it.

Speaker 1

但另一方面，演示表明它是有潜力的，如果再进行额外的工程开发等工作，或许能达到100%。但你知道，我听到人们质疑不同技术是否存在所谓的自动驾驶汽车问题——即做演示容易，但让汽车真正运行起来很难。即便到今天这依然困难，所以必须警惕这点。而我认为被低估的是数据和知识整合，甚至是模型与其他事物间的实时交互。随着搜索引擎等应用开始使用大语言模型，这方面正变得愈发重要。但如何通过输入获得优质结果，我认为仍需大量探索。

But on the other hand, the demos show that there's potential, that if you then do the additional engineering and stuff, might get it to 100%. But it's, you know, I've heard people asking whether, like, different, you know, technologies have what they call the self driving car problem, where, like, it's easy to make a demo, but it's hard to make the car, like, actually work. It's still hard to this day, so you've got to worry about that. And underhyped, I think, like, data and incorporating knowledge or incorporating even, like, real time, like, interactions between your model and something else, it's it's definitely becoming more relevant with, things like the search engines that use LLMs and stuff, but, like, really thinking about how what you feed in to to get a great result, I think, because there's a lot left to do there.

Speaker 0

太棒了。好的，下一个快问快答环节。在创办Databricks时，你希望当时就知道的一件事是什么？

Awesome. Alright. Next fire round question. What's one thing you wish you knew when starting Databricks that you know now?

Speaker 1

天啊，有太多不同的事情了。确实。我觉得商业中有无数领域是我当初不了解的，比如销售、市场营销、产品管理等所有这些方面，我们都是在过程中学习的。

Oh, man. There are many, many different things. Yeah. Yeah. I think there are just so many areas of business that, like, I didn't know about that, you know, we learn about, like, including, you know, sales, marketing, like, all these, you know, product management, all these things.

Speaker 1

另外就是，基本上别慌。沿途会遇到各种挑战，但只要拥有优秀团队和稳固的战略定位，最终会顺利度过。

Know, I mean, the other thing is just, you know, don't panic, basically. Like, you'll you know, there'll be things along the way, but it that that are challenging. But if if you have a great team and, you know, a solid sort of, like, strategic position, you'll you'll you'll do well. So.

Speaker 2

精彩。这真是场引人入胜的广泛对话。感觉我们还能展开无数话题，但知道你日程繁忙。就不多占用你日常工作时间了。想必大家还想了解更多关于Databricks和你在斯坦福的工作。

Awesome. Well, mean, just a fascinating wide ranging conversation. I feel like there are so many different threads we could pull on, I know you're you're a busy person. So we'll we'll leave you back to your day to day. I guess, you know, I'm sure folks wanna learn more about, you know, Databricks, the work you're doing at Stanford.

Speaker 2

他们深入了解的最佳途径是什么？

What's kind of the best way for them to to dig in further?

Speaker 1

是的。我觉得只要在Twitter或LinkedIn上关注我，就能看到很多我认为有意思的内容。

Yeah. I think if you just follow me on Twitter or LinkedIn, you know, know, something like that, you'll see a lot of the things that I think are cool.

Speaker 0

太棒了。Mate，非常感谢你的参与。这次对话非常精彩，极其有趣。衷心感谢。

Awesome. Well, Mate, thank you so so much for joining us. This this was great. Super interesting. Thanks so much.