语音AI的未来：与ElevenLabs联合创始人马蒂·斯坦尼舍夫斯基探讨代理、配音及实时翻译

本集简介

想象向国际象棋大师学习棋艺，或向FBI顶尖人质谈判专家请教策略。ElevenLabs的语音AI技术正让这种可能成为现实。本期节目中，Sarah Guo与ElevenLabs联合创始人Mati Staniszewski展开对话，探讨这家成立仅三年的公司如何通过语音技术重塑人机交互方式。Mati分享了构建基础音频模型的技术挑战，研究推进与产品落地同步进行的战略思考，以及为何语音将成为连接计算机、机器人乃至沉浸式媒介的终极界面。双方还探讨了AI个人导师即将引发的革命——如何将代理型AI从被动响应升级为主动支持，打破全球语言壁垒，甚至为政府服务代理体系提供底层框架。每周订阅最新播客。意见反馈请发送邮件至show@no-priors.com 推特关注：@NoPriorsPod | @Saranormous | @EladGil | @elevenlabsio |@matiii 章节标记： 00:00 – Mati Staniszewski介绍 00:46 – 11 Labs的成长与规模化 02:46 – 语音技术及应用场景 06:52 – 研究与产品开发 12:36 – 音质与客户偏好 17:54 – 代理平台与用例场景 23:21 – 选择合适的技术合作伙伴 26:43 – 基础模型的核心作用 29:58 – 开源模型与未来趋势 32:37 – 研发重点方向 36:53 – AI伴侣与教育未来 41:37 – 结束语

双语字幕

仅展示文本字幕，不包含中文音频；想边听边看，请使用 Bayt 播客 App。

Speaker 0

嗨，听众朋友们。

Hi, listeners.

Speaker 0

欢迎回到《无先验》节目。

Welcome back to No Priors.

Speaker 0

今天，我邀请到了Elevenlabs的联合创始人兼CEO马迪·斯特内泽斯，这家公司成立的初衷是通过语音技术改变人类之间及人机交互的方式。

Today, I'm here with Mady Stenezes, the co founder and CEO of Elevenlabs, which was founded to change the way we interact with each other and with computers with voice.

Speaker 0

短短三年内，他们的年化营收已飙升至超过3亿美元。

Over three short years, they've skyrocketed to more than 300,000,000 in run rate.

Speaker 0

玛迪和我讨论了语音技术的未来、教育、客户体验以及其他应用领域，以及如何构建一个从自助服务到企业服务的多领域业务，同时结合研究与产品的公司。

Mady and I talk about the future of voice, education, customer experience, and the other applications of this voice, as well as how to build a multisegment from self serve to enterprise and combined research and product company.

Speaker 0

欢迎你，马蒂。

Welcome, Matti.

Speaker 1

萨拉，谢谢邀请我。

Sara, thanks for having me.

Speaker 1

也谢谢你

And thank you for

Speaker 0

一大清早七点钟就来做这个。

doing this at seven in the morning.

Speaker 1

我们很荣幸。

Our pleasure.

Speaker 1

谢谢你早上七点就来做这个。

Thank you for doing that at seven in the morning.

Speaker 1

很高兴我们终于能一起做这件事了。

It's great we got to finally do this together.

Speaker 0

我想很多听众都曾经用过或体验过Eleven的产品，不过对于其他人，你能再介绍一下公司吗？

I think a lot of our listeners will have used or played with Eleven at some point, but for everybody else, can you just reintroduce the company?

Speaker 1

当然可以。

Definitely.

Speaker 1

在Eleven Labs，我们致力于解决人类与技术如何互动的问题，让你能无缝地通过技术进行创作。

We, at Eleven Labs, we are solving how humans and technology interact, how you can create seamlessly with that technology.

Speaker 1

具体来说，我们构建基础音频模型，这些模型能帮助你生成拟人化的语音，更准确地理解语音，或协调所有这些组件实现交互功能，然后在这些基础模型之上构建产品。

What this means in practice is we build foundational audio models, so models in a space to help you create speech that sounds human, understand speech in a much better way, or orchestrate all those components to make it interactive, and then build products on top of that, foundational models.

Speaker 1

我们还有创意产品平台，可协助制作旁白、有声书、配音、广告或电影，以及这些电影的其他语言配音；以及我们的智能代理平台产品，本质上是帮助提升客户体验的服务，可构建用于个人AI、教育以及新型沉浸式媒体的代理。

And we have our creative product, which is a platform for helping you with narrations, for audiobooks, for voiceovers, for ads or movies, or dabs of those movies to other languages, and our agents platform product, which is effectively an offering to help you elevate customer experience, built an agent for personal AI, education, new ways of immersive media.

Speaker 1

但所有这些都贯穿着同一个使命：探索如何以更好的方式按我们的条件与技术互动。

But all is kind of underlined with that mission of solving how we can interact with technology on our terms in a better way.

Speaker 0

你们公司是2022年成立的？

You started the company in 2022?

Speaker 1

没错。

That's right.

Speaker 0

从那以后你们就经历了惊人的火箭式增长。

And you've had amazing, like, rocket ship growth since then.

Speaker 0

我确信这一路有起有落。

I'm sure it's fell to up and down different ways.

Speaker 0

我想就此请教你。

I wanna ask you about that.

Speaker 0

能谈谈公司目前的大致规模吗？

Can you give a sense of what the scale of the company is today?

Speaker 1

目前我们在全球已有350名员工。

So we've grown to three fifty people globally.

Speaker 1

我们最初从欧洲起步。

We started from Europe.

Speaker 1

我们以远程办公模式起家，至今仍以远程优先，但在全球设有多个枢纽站点——伦敦规模最大，纽约次之，华沙、旧金山、东京各有一处，巴西也设有一个。

We started as a remote company and are still remote first but have hubs around the world with London being the biggest, New York being the second biggest, Warsaw, San Francisco, and now Tokyo, one in Brazil.

Speaker 1

我们的年度经常性收入已达3亿美元，其中自服务订阅和创作者使用创意平台的收入约占一半，企业客户使用代理平台Work的收入正接近50%。在创意业务方面，我们每月活跃用户超过500万。而在企业端，我们拥有数千0家客户，涵盖财富500强企业和一些增长最快的AI初创公司。

We are at 300,000,000 in ARR, which is roughly fiftyfifty between self serve, so a lot of subscription and creators using our creative platform, and then approaching 50% on the enterprise side using our agents platform, Work, and that's on the SalesLed classic SalesLed side, and we serve more than 5,000,000 monthly actives on that on that on that creative side of the work, And then on the enterprise side, we have a few thousand customers from Fortune 500s to some of the fastest AI growing startups.

Speaker 0

我认为你是一位了不起的创始人，但我也觉得这家公司非常有趣，因为对许多人——尤其是投资者来说，它的模式相当反直觉。

I think this is such a you're an amazing founder, but I also think this is such an interesting company because it is very unintuitive to, I think, many people and investors in particular.

Speaker 0

不知道你初期是否遇到过这种情况，但我记得2022年就在现场见证。

I don't know if you faced this at the beginning, but I remember I was there in 2022.

Speaker 0

当我们审视你们除研究之外的首个业务时，会发现存在一类支持某种形式创作的公司。

There's a there's a class of companies that allow creation in some way when we look at your, like, first business beyond the research itself.

Speaker 0

我会将ElevenLabs、Midjourney、Suno和Haizen都归入这一类别。

And I would put Eleven and Midjourney and Suno and Haizen in this category.

Speaker 0

而且我觉得整体上存在这样一种疑问，真的有人想做这个吗？

And I think there's, like, this overall sense of, like, who really wants to do this?

Speaker 0

你最初对有多少人想制作声音的判断是怎样的？或者是什么让你相信这会比配音市场大得多？比如配音市场本身并不大。

What was your initial read of, like, how many people want to make voices, or what made you believe that was gonna be much broader than like, if I look at dubbing, for example, it's not a huge market.

Speaker 0

我认为

I think

Speaker 1

第一点是，正如你提到的，同时做产品和研究非常具有挑战性。

first piece was which is, as you mentioned, there is, like, a very it's very tricky to do both the product and the research.

Speaker 1

我很幸运，我的联合创始人和我相识已有十五年。

I'm in lucky position that my co founder and I known each other for fifteen years.

Speaker 1

我认为他是我认识的最聪明的人，能够完成大量研究工作来奠定基础，从而提升用户体验。但我们俩都来自波兰，最初的信念也源自波兰。

I think he's the smartest person I know and has been able to create a lot of that research work to be able to create that foundation to then elevate that experience, but both of us are from Poland originally, and the original belief came from Poland.

Speaker 1

这是非常独特的现象——如果你看波兰语配音的外国电影，所有角色不论男女都由同一个声音演员配音，整部电影都是单调的旁白式演绎。

It's a very peculiar thing, but if you watch a movie in Polish language, a foreign movie in Polish language, all the voices, whether it's a male voice or a female voice, are narrated with one single character, so you have like a flat delivery for everything in the movie.

Speaker 0

这种体验太糟糕了。

That's a terrible experience.

Speaker 1

这种体验糟糕透了，而且即便你长大后，一旦学会英语就会立刻切换，再也不想以这种方式观看内容。疯狂的是，如今大部分内容仍以这种方式呈现。

It is a terrible experience, and it's still like, if you grow up, as soon as you learn English, switch out and you don't want to watch content in this way, and it's crazy that it still happens until today in this way for a majority of content.

Speaker 1

结合这一点，加上我在Palantir工作过，我的联合创始人在谷歌工作过，我们知道未来这种情况会改变，所有信息都将全球化。

Combining that, and I worked at Palantir, and my cofounder worked at Google, we knew that that will change in the future and that all the information will be available globally.

Speaker 1

随着我们深入挖掘，我们意识到，

And then as we started digging further, we realized So,

Speaker 0

以高质量的方式覆盖每一种语言。

in every language in a high quality way.

Speaker 1

那种情况

That was kind

Speaker 0

还有

of And the

Speaker 1

关键点在于，与其仅仅进行翻译，能否保留原声、原始情感和语调的传递？

the big thing was like, instead of having it just translated, could you have the original voice, original emotions, original intonation carried across?

Speaker 0

嗯

Mhmm.

Speaker 1

所以，想象一下这个播客，但人们可以切换到西班牙语，他们仍然能听到萨拉的声音。

So, like, imagine having this podcast, but say people could switch it over to Spanish, and they still hear Sara.

Speaker 1

他们仍然能听到马蒂的声音，同样的声音，同样的表达方式，这正是我们之前与莱克斯采访纳伦德拉·莫迪时所实现的，让你能更好地沉浸在那个故事中。

They still hear Matti and and the same voice, the same the same delivery, which is kind of exactly what we did with Lex back when he interviewed Narendra Modi, and you could kind of immerse yourself in that story a lot better.

Speaker 1

这就是最初的洞见，我们随后进一步深入挖掘，发现我们日常接触的很多技术都将发生变革。

So that was the original kind of insight, we then started digging further, which is that just so much of the technology we interact with will change.

Speaker 1

无论是创作方式，要让声音生动起来仍然相对困难。

Whether this is how you create, it's still relatively tricky to bring voice alive.

Speaker 1

需要经历昂贵的过程：雇佣配音演员、准备录音棚空间、使用高价工具来实际调整声音。

Need You to go through the expensive process of hiring a voice talent, having a studio space, having expensive tooling to then actually adjust it.

Speaker 1

这些工具操作不直观，难以实现目标，因此整个创作流程必须改变，让有热情的新人更容易实现创意。而且过去的技术无法重现特定声音或达到高质量效果。随着我们深入研究并转向动态内容，整个交互系统的运作方式仍然令人难以置信——尽管我们见证了近几十年的技术演进，但人们依然大部分时间花在键盘上，盯着屏幕，这种交互方式显得过时。

The tooling isn't intuitive to be able to do this, so all that creation process will and should change to make it easier for new people with keenness to bring that to life, then a lot of the technology wasn't possible for you to be able to recreate a specific voice or be able to create that in that high quality way, and then, of course, as we dived into further and shifted away from the static piece, the whole interactive piece is still crazy in the way it functions where most of us have seen this technological evolution over the last decades, but you still will spend most of your time on the keyboard, you will look at the screen, and and that interface feels broken.

Speaker 1

理想的交互应该是通过语音——这种最自然的界面与设备沟通，这种交流方式自人类起源时就存在。我们意识到这正是我们想要解决的问题。

It should be where you can communicate with the devices through through speech, through the most natural interface there is, one that kind of started when the humanity humanity started, and and we realized we want to we want to solve that.

Speaker 1

我认为从现在（2022年）快速展望未来，很多人也会认同：声音将成为未来的主流交互界面，无论是智能手机、电脑还是机器人等环绕我们的设备。

And I think now, fast forward from 2022, I feel like many people will carry that belief too that voice is the interface of the future as you think about the devices around us, whether it's smartphones, whether it's computers, whether it's robots.

Speaker 1

语音将成为关键要素之一，但2022年时还未如此。当我们审视创意领域或交互领域的市场前景时，已能清晰预见其将发展成极其庞大的市场。

Speech will be one of the key ones, but I think 2022 it wasn't, and as we think about the market for the creative side or for that interactive side, it was very clear it will be a huge, a huge, huge one.

Speaker 0

单就你业务中的研究板块而言，你们已为至少两个不同市场开发产品，同时还肩负着更宏大的使命——过去五到十年间确实发生了翻天覆地的变化。

So even when you think about just the research part of your business, and then you have products for at least two different markets, and then you have this larger mission, a lot has changed in the last five or ten years.

Speaker 0

但过去创业圈曾有个根深蒂固的传统观念：初创公司必须专注做好一件事，除此之外别无他途。

But it used to be a very strongly held traditional belief of like, one must do one thing well in a startup, and there's no other path.

Speaker 0

你们现在这种运营模式，更像是在打造一家交互公司、平台型企业。

Like, you're treating this like an interaction company, a platform company.

Speaker 0

你们是如何考虑研究工作和产品开发的先后顺序的？

How did you think about sequencing, like, the research and the product effort?

Speaker 0

这样安排合理吗？

Does that make sense?

Speaker 0

或者，对于新市场的考量是怎样的？

Or, like, thinking about new markets?

Speaker 0

或许这个问题还包含另一个层面：目前语音技术的质量处于什么水平？

And maybe wrapped up in that question too is just like, well, where are we in quality on voice as well?

Speaker 0

因为如果我要断言，模型在某些用例上根本不够好，那这某种程度上就没有意义。

Because if if I I would sort of claim, like, if the models are not good enough for certain use cases at all, like, it kinda doesn't make sense.

Speaker 0

产品原因？

Due product?

Speaker 1

我认为这是对的。

And I think that's right.

Speaker 1

这几乎就像我们最初开始时的情况，我们尝试使用市场上现有的模型，并针对我们的第一个用例进行优化——实际上是从旁白和配音的组合开始的，然后在创意方面，我们很快意识到现有模型生成的语音非常机械且质量不佳，人们根本不愿意听，这时我的联合创始人Genius介入，他组建团队并亲自参与大量研究，最终创造出能实现这一工作的新版本。

It's almost exactly like when we started originally, what we did was try to actually use existing models that were in the market and optimize them for our first use case was actually starting with a combination of narration and dubbing, and then on the creative side, and we realized pretty quickly that the models that existed just produced such a robotic and not good speech that people didn't want to listen to it, and that's where my cofounder Genius came in where he was able to assemble the team and do a lot of the research himself to actually create a new version of creating that work.

Speaker 1

但回到你的问题，我认为我们内部的组织方式和对任务排序的思考，很大程度上是围绕第一个问题展开的，然后围绕这个问题有效地建立一个实验室，这就像是集结优秀的研究人员、工程师和运营人员共同攻克这个问题的组合。

But, like, to your question, I think that the way we are kind of organized internally and how we think about sequencing a lot of that was looking at the first problem and then creating effectively a lab around that problem, which is like a combination of mighty researchers, engineers, operators to go after that problem.

Speaker 1

而第一个问题就是语音问题，即我们如何能重现语音？

And the first problem was the problem of voice, so how can we recreate the voice?

Speaker 1

正如你所说，这需要具备相应的研究专长才能做好。

And like you say, it needs to have that research expertise to be able to do that well.

Speaker 1

所以我们最初实质上是建立了一个语音实验室，其使命就是：我们能否以更好的方式叙述作品？

So we started with effectively a voice lab, which was that mission of can we narrate the work in a better way?

Speaker 1

当时大约有五个人共同开展这项工作，我们首先进行研发排序，然后在这项工作基础上构建一个简单层，让人们能够使用该成果，再通过一套完整的解决方案逐步扩展，最终实现完整的音频书制作和完整的电影旁白、电影配音。

There was a combination of roughly five people that were doing that work, and then sequence the research first, and then build a simple layer on top of that work to allow people to use that work, and then kind of expand it from there with a holistic suite for creating a full audiobook and then creating a full movie narration, movie dab.

Speaker 1

接着我们转向下一个问题，意识到我们已经解决了语音问题，这对于让内容听起来更人性化是极好的。

And then we move to the next problem, which is the realization that, okay, we have solved the voice, great for making content sound human.

Speaker 0

嗯。

Mhmm.

Speaker 1

首要问题是，要让这项技术对我们有用，需要解决如何按需将知识融入其中的问题。

The first problem, For that to be useful for us to interact with the technology, you need to solve how you bring the knowledge on demand into that.

Speaker 1

于是我们实际上启动了第二个团队，也就是第二个实验室——本质上是一个代理实验室，这个团队再次集结了研究人员、工程师和操作人员，他们致力于解决：既然我们已经有了文本转语音技术。

So we effectively started then the second team, which was a second lab, an agent lab effectively, which was a team that would combine researchers, engineers, and operators once more, which would try to fix, okay, we have text to speech.

Speaker 1

那么现在如何将这项技术与LLabs、语音转文本技术相结合，并协调所有这些组件，同时与其他系统集成以简化流程。

How do they now combine this with LLabs and speech to text and orchestrate all those components together while integrating that with other systems to make it easier.

Speaker 1

同样地，你会从仅关注语音层扩展到研究这些系统如何协同工作，这里同样需要研究专业知识来实现低延迟、高效且准确的方式，但与此同时，产品层也开始形成——重要的不仅仅是协调。

And then similarly, you kind of expand from looking just at the voice layer into how those systems work together, and here too, you need the research expertise to do that in a low latency way, efficient way, accurate way, but at the same time, there's that product layer that starts forming that it's not only the orchestration that matters.

Speaker 1

还包括如何与遗留系统对接的集成方式，如何围绕其构建功能，或者如何在生产环境中部署、测试、监控和持续评估。

It's also the integrations of how you link up to the legacy systems, how you build functions around it, or how you deploy that in production and test, monitor, evaluate over time.

Speaker 0

你觉得你们是在创造新的应用场景吗？

Do you feel like you were creating new use cases?

Speaker 0

当你们开发这些工具时，人们是否已经知道自己需要这些功能？

When you built the tools, do people know that they wanted to do this already?

Speaker 0

因为我记得听过一个论点说，企业其实不知道该怎么运用语音技术

Because one argument, like, that I remember hearing was like, ah, like, you know, enterprises don't know what to do with voice.

Speaker 0

有多少人是真正需要它的？

How many people really want to do it?

Speaker 0

那么你们服务的本质上是创作者和出版商这类群体对吧

And then you're serving essentially like perhaps the creator publisher side of Right.

Speaker 0

你们的

Your

Speaker 1

这绝对是两种因素的结合：一方面是我们相信未来会出现的需求，另一方面是对大量需求的响应

It's definitely a combination of like initiatives that we believe will happen in the world, and then like response to a lot of that.

Speaker 1

就像我回想起来，当然，最初成立的内部语音实验室和智能体实验室后来催生了许多其他实验室来应对衍生问题

So like as I think back, you know, of course, voice the internal voice lab or agents lab then kind of that kick started so many of the other labs in response to their problems.

Speaker 1

我们成立了一个音乐实验室，因为人们希望用11号实验室来创作音乐，这是一个完全授权的模式，用户既想使用和生成语音，又希望能简单地添加音乐元素。

We started a music lab because people wanted to create music with 11 labs, so it's a fully licensed model where people wanted to use and create speech, but they wanted to add music in a simple way.

Speaker 1

我们想要实现这个目标。

We wanted to deliver that.

Speaker 1

当然，这一切的融合点在于：我们如何将音乐、音频和声音元素有机结合。

And then, of course, that kind of came together through how do we combine music, audio, sounds.

Speaker 1

我们现在正将图像和视频领域的合作伙伴模型整合进这个套件，探索如何将所有功能集于一体——这些创新都是对市场需求的直接回应，用户一直在向我们表达'我们太需要这个了'的呼声。

We are now integrating partner models from image and video into that suite, how could you combine all of that in one, and all of that was in response to the market saying us, hey, we would love this.

Speaker 1

然后你会发现完全不同的应用场景，即便是在同一个领域，比如配音。

And then you will have completely different use cases, even in that space, let's say dabbing.

Speaker 1

配音是我们并未感受到强烈需求的一个用例，但我们知道在理想的未来世界里，你将能够自然地获得跨语言的内容传递，同时保持原汁原味。我依然认为这个市场潜力巨大，因为它不仅限于电影中的静态传递，如果你环游世界时想要实时实现——就像《银河系漫游指南》里的巴别鱼构想——这将成为现实。

Dabbing is a use case that we didn't feel there was a big push for that, but we knew that in the ideal world in the future, you will be able to have that content delivered naturally around the languages, still carrying that, and I still think actually this market will be immense because it's not going to be only the static delivery in movies, but if you travel around the world and want to in real time, like the full Babelfish idea from Hedgehacker's Guide to the Galaxy, this will happen.

Speaker 1

这将是最重大的突破，彻底消除语言障碍、沟通障碍和创作障碍，所有壁垒都将被打破，这就是实时配音的基础理念。

It will be like the biggest, like, whole breaking down language barriers, the barriers to communication, to creation, all of that will break, and that will be the foundational real time dabbing concept.

Speaker 1

所以我对这一部分感到非常兴奋。

So super excited about that part.

Speaker 1

同样地，在代理端，有些显而易见的需求，比如我们合作的客户或合作伙伴会希望与XYZ系统集成，但也有一些不那么容易预测的部分。当你与技术互动时，自然希望了解发生了什么，但同时也想理解表达方式，并将其纳入默认设置——这正是我们团队努力优先实现的目标。这样当人们真正与技术互动时，他们会意识到：富有表现力的交互体验原来如此愉悦、有益且实用。

And similarly, on the agent side, you are some obvious things that, of course, customers that we work with or partners will want to integrate, which is we want integrations with XYZ systems, but then there are other parts that might not be as easy to predict of as you interact with technology, of course want to understand what's happening, but you also want to understand how the things are being said and bring that into default, which would be something we try to prioritize on our side, so then the people, when they actually interact with technology, they realize, oh, expressive thing is actually so much more enjoyable and beneficial and helpful.

Speaker 0

那么，我想就此提一个关于质量的问题。

So, wanna ask you a question about this, which relates to quality.

Speaker 0

你知道，我合作的多家企业都在向采购方销售产品。

You know, I work with a series of companies where we're selling a product to the buyers.

Speaker 0

他们通常都不是机器学习领域的科学家。

They're generally not machine learning scientists.

Speaker 1

对吧？

Right?

Speaker 1

没错。

Right.

Speaker 0

即便是科学界也缺乏完整的评估体系和基准来全面理解每个领域。

And even the scientific community does not have the full suite of evals and benchmarks to understand every domain well.

Speaker 0

这是个众所周知的问题。

It's a well known problem.

Speaker 0

但我猜想对于你们的许多客户来说，他们并不懂得如何选择好的声音。

But I imagine for a lot of your customers, it's not like they know how to choose good voice.

Speaker 0

那么你们是如何解决这个问题的？

So how do you deal with that problem?

Speaker 0

是不是说，'嘿，我做了个克隆声音，听起来像我，我就相信它'？

Is it like, hey, I make a clone, and that sounds like me and I believe in.

Speaker 0

我要尝试所有这些不同的选项，还是说你们其实在教人们做评估？

I'm gonna try all of these different options, or actually, are you teaching people to do eval?

Speaker 1

这是个很好的问题，因为我认为存在两大难题。

It's a great question because I think there are, like, two big problems.

Speaker 1

第一个是，如何在音频这个普遍领域建立基准，正如你所说，这高度依赖于特定声音，更不用说如果要训练成互动式的，那就更加棘手了。

One is, like, how do you benchmark the general space in audio where, like you say, it's, like, so dependent on the specific voice, let alone, like, if you are training it to interactive, then it's, like, even more tricky.

Speaker 1

第二个问题是，当你在处理具体用例时，该如何选择声音？

And then the second piece, which is as you are working on a specific use case, how do you select a voice?

Speaker 1

所以，我先从第二个问题说起，实际上我们为企业服务时配备了一位'声音品鉴师'。

So, I'll take the second front first, which is we have a voice sommelier effectively as we work with enterprises.

Speaker 1

我们会安排这位人员与他们合作，帮助他们进行导航。

We deploy that person to work with them and help them navigate.

Speaker 1

这位人员就像一位声音教练，自己拥有出色的嗓音，现在我们有一个团队在其带领下协助您找到最适合该发票的品牌声音。

That person is like a voice coach, has an incredible voice themselves, and now we have like a team under that person that will partner to help you find what's the right branding of that invoice.

Speaker 0

现在你们有了名人市场。

And now you have like the celebrity marketplace.

Speaker 0

现在你们

And now you

Speaker 1

拥有一个名人市场，甚至能帮您引入标志性人才，比如迈克尔·凯恩爵士这样的。

have a celebrity marketplace to help you even get the iconic talent in there, like Sir Michael Caine.

Speaker 1

这一点很重要，因为当然，他们的声音将取决于您试图构建的用例和语言。

That piece was important because, of course, their voice will depend on the use case that you are trying to build, the language.

Speaker 1

所有这些都会影响什么才是适合您客户群的声音，所以我们实际上有一位声音专家协助这些公司。有些公司对想要的声音非常明确，他们会自行选择，有时会给我们一个简要说明：我们需要听起来专业、中性的声音。

All of that will have an impact of what's the right voice for your customer base, so we have effectively a voice person helping those companies, and some companies will be very opinionated on what they want, so they will sometimes select it themselves, sometimes give us a brief of, hey, we want a voice that sounds professional, neutral, it's coming.

Speaker 1

我们最近遇到一家欧洲顶级公司，他们给我们的要求非常独特——想要尽可能机械化的声音。

We recently had a company, one of the biggest European companies that gave us a brief, which is very original, that they wanted as robotic voice as possible.

Speaker 1

这有些反直觉。

Which was counterintuitive.

Speaker 1

但首先

But first of

Speaker 0

你觉得我们现在做不到那样了。

You feel like we can't do that anymore.

Speaker 1

差不多，但我们当时在尝试逆向思考，该怎么实现这个？

Almost, but we were like trying to go backwards of like, how do we do that?

Speaker 1

不过我认为我们得到了不错的结果。

But I think we got a good result.

Speaker 1

但最近我们在日本和韩国遇到一家公司，他们希望根据来电客户的不同提供不同的声音。

But recently we had a company in Japan and Korea where they wanted to serve different voices depending on the customer that's calling in.

Speaker 1

他们既有年长群体也有非常年轻的群体。

They have a older population and a very younger population.

Speaker 1

对于年轻群体，他们想要市场上那种充满活力、欢快的知名声音；而对于年长群体，则想要一种平和、语速缓慢的声音。

The younger one, they wanted one of the famous voices in the market that's very excitable and happy, and for the older one, they wanted a calm, slow speaking one.

Speaker 1

我们在语音部分提供了很多帮助，我认为这将是一个重要且关键的领域

We help a lot with that, so that's on the voice piece, and I do think it's going to be a big and important

Speaker 0

所以，就像是个性化选择，甚至可以根据客户动态调整

So, like, personalized choice, and then it can even be dynamic in a customer.

Speaker 1

是的

Yes.

Speaker 0

好的

Okay.

Speaker 1

正是如此，正是如此。

Exactly, exactly.

Speaker 1

也许在未来，根据你的互动情况，系统会完全按照你的偏好生成一个声音。比如你在晚上感到疲惫时，可能想要一个稍微不同的声音，或者也可能不需要。

And then maybe in the future, it's going to be fully, depending on your interaction, you will have a voice created as we understand the preferences of what people want, so let's say you're in the evening and you are tired and you want a slightly different Or maybe not.

Speaker 1

也许在你最专注的时候，一个充满活力的声音是最合适的；而当你早晨醒来时，可能需要一个播报新闻或天气预报的不同声音。

Maybe that's the best focus time that you have a voice that's giving that energy, and probably it's different when you wake up and gives you the morning news of what's happening or what's the weather.

Speaker 1

所以，所有这些都可以因人而异。

So, all of those could be different.

Speaker 1

昨天，我们和一些合作伙伴共进晚餐，其中一位一开口就说

Yesterday, we had a dinner with some of our our partners, and one of them the first thing they said is like, hey.

Speaker 1

我有个新需求要找你

I have a new request for you.

Speaker 1

我想要一个带长岛口音的纽约腔，我之前都不知道还有这种口音，据说这是地域特色

I want a New York voice with a Long Island accent, which I never knew is a thing, and it's a territory supposedly a thing.

Speaker 1

所以我们就接下了这个需求

So so we have that.

Speaker 1

然后在第一部分，我认为这仍然是一个未解决的问题，当然在大型语言模型领域已有不错的基准测试。

And then on the first piece, I don't I think it's unsolved problem still where I think you have a good benchmarks, of course, in LLMs.

Speaker 1

我认为在图像领域，它们表现得相当不错。

I think in ImageSpace, they are pretty good.

Speaker 1

在语音领域，虽然语音质量很重要，但你是否喜欢一段语音很大程度上取决于声音本身。比如比较模型A和模型B时，即使质量差异很大，仅声音的不同就能带来完全不同的体验。

In VoiceSpace, you you have, of course, the speech quality, but then so much of whether you like or not the speech depends on the voice that just if you compare model a to model b and you serve them different voices, even if the quality is very different, the voice itself can just make that so different.

Speaker 1

我们已经见证过这种情况。

We've seen this.

Speaker 1

我不知道你是否了解，人工分析基准，我认为它们相当不错。

I don't know you know, artificial analysis benchmarks, I think they're pretty good.

Speaker 1

仅仅切换声音就能带来如此大的优势。

Just switching the voice makes such big advantage.

Speaker 0

这真有趣。

That's so interesting.

Speaker 0

是啊。

Yeah.

Speaker 0

我在想，正如你所说，这是人类历史数千年来一直存在的模式。

And I wonder if, as you said, this is mode we've had for millennia of all of human history.

Speaker 0

我有偏见

I'm biased

Speaker 1

而且自私，但我想是的。

and self serving, but I think so.

Speaker 0

我们 ago

We're just very sensitive to it.

Speaker 0

而且我认为人们也会对自己的个性化非常敏感。

And I think people are gonna be very sensitive to their own personalization as well.

Speaker 1

百分百同意。

A 100%.

Speaker 1

我认为还有第三点，虽然可能不完全符合你的观点，但我们还意识到有了基准测试之后。

I think this is also a third piece, which maybe is not directly to your your to your note, but we've also realized that you have so you have the benchmarks.

Speaker 1

比如，如何为我的受众找到合适的声音？

You have, like, how do I find the right voice for my audience?

Speaker 1

但即便是对音频数据描述的理解，整个行业仍然滞后。

But even the understanding of how you describe audio data is still lagging in the industry.

Speaker 1

就像我们最初开始时，当然也找过传统服务商帮我们标注，不仅是文字转录，还包括表达方式——比如情绪、用法、口音等。

Like, when we initially started, we of course went into the traditional players for them to help us label not only what was said, so like transcription, but also how it was said, like what are the emotions, use, accent.

Speaker 1

但大多数人无法有效完成这项工作，因为这需要具备一定的听力技巧来描述特定表达方式，所以我们不得不自己建立这套体系。我认为这也是如何从定性角度有效解读音频数据的关键环节。

And most people just weren't able to do that work effectively because you kind of need to hear and have a little bit of a skill set of how would I describe this specific delivery, so we need to create that ourselves, so I think that is that piece as well of how do you effectively interpret the data of audio in a more qualitative basis.

Speaker 1

是的，这部分确实更棘手。

That's, yeah, trickier.

Speaker 0

你能谈谈代理平台方面的情况吗？

Can you talk about what's happening on the agent platform side?

Speaker 0

比如，对于试图构建代理的企业甚至创作者来说，哪些方面具有挑战性？以及哪些可能是出人意料或高需求的用例？

Like, what is challenging for, you know, businesses or even creators that are trying to build agents and what the maybe what the surprising or high traction use cases are?

Speaker 0

大家都知道基于代理的客户支持这个概念，但我猜你们做的远不止这些。

I everybody's kind of aware of the idea of, like, agent based customer support, but I imagine you're doing many things beyond that.

Speaker 1

是的。

Yeah.

Speaker 1

确实如此。

So the exactly.

Speaker 1

客户支持可能是发展最快的领域，我们看到它已经覆盖了众多应用场景，这就是为什么我与思科、Twilio或Telus Digital合作，他们都在很大程度上提升了这一领域。

Customer support is probably the one that's, like, kicking off the quickest, and and that's the the one that, like, we see overtaken so many use cases, that's why I work with Cisco or Twilio or Telus Digital, all of them are kind of elevating that to a high extent.

Speaker 1

我认为该领域中第二个令人兴奋的转变是从被动式客户支持——即我有问题才联系客服，转向更积极主动的客户体验支持。

I think the second exciting piece within that domain which is happening is the shift from effectively a reactive customer support, I have a problem, I'm reaching out to customer support, into more of like a proactive part of the experience customer support.

Speaker 1

具体来说，我们与印度最大的电商平台Micho合作，他们最初从处理退款、查询包裹轨迹等客服功能，发展到让客服成为用户体验的前端环节。

So, to make it explicit, we work with the biggest e commerce shop in India, Micho, where they started working on the customer support side where I want a refund, I wanna see the tracking of the package, to actually having an agent be a front part of the experience.

Speaker 1

所以如果你访问网站，你会看到那个小部件。

So if you go to the website, you have you have the the widget.

Speaker 1

你可以通过语音与它互动，然后询问它，嘿。

You can engage it through voice, and you can ask it, hey.

Speaker 1

你能帮我找到商品X、商品Y吗？或者你能告诉我这段时间该选什么作为礼物最合适吗？

Can you help me navigate to item x, item y, or can you explain me what's the right thing for me to give up for a gift for this period of time?

Speaker 1

然后它会根据你的问题、根据当前提供的商品，实际帮助你展示那些物品，引导你到正确的部分，甚至可能完成整个结账流程。我认为这将是一个提升整体体验的非凡功能，更像是贯穿全程的助手。

And then it will actually help you based on your questions, based on what is on the offer, show you those items, navigate to the right parts of the piece, maybe go all the way through the checkout, and I think this will be a phenomenal thing of, like, elevating the full experience where that's more of an assistant across the whole thing.

Speaker 1

我们与Square合作启动了这项工作，使其他企业也能实现这一功能。

We kicked off our work with Square that enables other businesses to do that work.

Speaker 1

完全相同的模式。

Exactly the same pattern.

Speaker 1

最初是从语音点餐开始的。

It started with voice ordering.

Speaker 1

现在如何将其融入完整的发现体验中，包括商品展示和更详尽的解释，我认为这将是一个惊人的突破，实现从始至终的全流程覆盖。

How can now this be part of the full discovery experience too where you get items shown to you, you can have a lot more explanation, which I think will be a phenomenal piece where it's effectively from the beginning to the end.

Speaker 1

所以，这是一个类别。

So, that's one category.

Speaker 1

第二个是从静态媒体向沉浸式媒体的广泛转变，这里有太多精彩的知识产权故事，目前仅以一种方式呈现，而现在你将能以全新方式与这些内容互动。

The second one is the wider shift from static to immersive media, where there's just so much incredible stories in IP that today exist in effectively one way of delivery, and now you'll be able to interact with that content in a completely new way.

Speaker 1

我认为与Epic Games的合作是一个绝佳的应用案例。

We I think one of the incredible use cases was working with Epic Games.

Speaker 1

我们与他们合作将达斯·维达的声音和形象引入《堡垒之夜》，数百万玩家可以在游戏中实时与维达互动，获得全新的完整体验。我认为这将成为一个普遍趋势——无论是与书籍对话、与你喜欢的角色交谈，整个领域都在发生转变。

We worked with them on bringing the voice of Darvader and Darvader into Fortnite where millions of players could interact with Darvader live in the game where you had, like, a full experience of of Darvader in a in a in a a new way, and I think this will be a theme across whether it's talking to a book, talking to the character that you like, to the whole space shifting.

Speaker 1

而最让我为世界和这一变革感到兴奋的将是教育领域，你将能通过耳机获得一个私人导师，以惊人的方式学习知识。

And then I think the one that I'm most excited about for the world and for the shift is going to be education, where you will just be able to have effectively a personal tutor on your headphone, and you could actually study something in an amazing way.

Speaker 1

我快速举两个例子。

I'll give you two quick examples.

Speaker 1

一个是我们最近与chess.com的合作，我是国际象棋的超级粉丝。

One is we recently worked with chess.com, and I'm a huge fan of chess.

Speaker 1

我超级痴迷。

I'm a huge fan.

Speaker 1

好的。

Okay.

Speaker 1

太棒了。

Great.

Speaker 1

所以你可以学习国际象棋，还能让●●●（保留英文人名）或●●●（保留英文人名）当你的老师来指导你，这简直太神奇了。或者甚至可以选择●●●（保留英文人名）姐妹，就像所有参与其中的不同棋手那样，我觉得这很棒。最后一个例子是我们与大师课合作的项目——你当然可以按部就班地学习内容，但还能获得互动体验。最典型的案例就是与FBI顶级谈判专家 Chris Boss合作，他0年0月0日

So you can learn chess, but you can have Hikaru Nakamura or Magnus Carlsen be your teacher of how you deliver that, which is amazing, or even Botis sisters, or it's like all the plethora of different players that engaged with that, which I think is great, and then maybe a last one, which is a master class that we worked with to shift from you can, of course, have the content go through step by step, but you can also have an interactive experience, and the best example of that was working with Chris Boss, the FBI negotiator, one of the top negotiators, who has a masterclass lesson, but then you can actually call him and have a practice negotiation, which is crazy.

Speaker 0

嗯。

Yeah.

Speaker 0

必须把人质救出来。

Gotta get that hostage out.

Speaker 0

我们一定会试试的。

We'll definitely try it.

Speaker 1

是啊。

Yeah.

Speaker 1

我能再补充一点吗？

Can I add one more?

Speaker 1

我认为最后一个结合了所有这些要素的例子，是我最近才意识到的，简直令人难以置信。

I think the one last one, which combines all of them together, which I I realized just recently is which was crazy.

Speaker 1

最近，我去了乌克兰，我们正在与转型部合作，他们实际上正在创建第一个具有自主能力的政府，而令人惊讶的是他们拥有所有这些组成部分。

So, recently, I went to Ukraine where we are working with Ministry of Transformation, where they are effectively creating a first agentic government, and the crazy thing is they have all of those pieces.

Speaker 0

具有自主能力的政府。

Agentic government.

Speaker 1

一个具有自主能力的政府。

A gentic government.

Speaker 1

所以他们想要彻底改变所有部门的运作方式。

So they want to re change how they run all the ministries.

Speaker 1

这听起来像是一个宏大而雄心勃勃的目标，我很欣赏这一点。

And it sounds like a big ambitious goal and love to you.

Speaker 0

不，我认为基准就在这里，所以实际上我对此...

No, think the baseline is here, so actually I'm by that And

Speaker 1

令人难以置信的是，我认为他们在实际执行方面已经遥遥领先，我认为这体现在两个具体方面。

the crazy thing is I think they are so ahead in actually doing that, and I think there are two concrete things there.

Speaker 1

第一，他们整合了所有这些应用场景，我们正在研究如何实现高效的政府客户服务，无论是咨询福利、就业，还是出境流程。

One, they kind of combine all those use cases, so we are looking into how they can have effectively customer support of government, whether it's asking about benefits or employment, about the process of how you leave the country.

Speaker 1

所有这些都通过一个数字化应用来运行；第二，如何主动通知公民可能发生的事件，并通过这种个性化辅导体验来辅助教育系统，这一切都在推进中，令人惊叹。第二个惊人之处在于他们的实施方式——他们不仅拥有数字化转型模块，还在每个部委设立了工程负责人来牵头这些工作，然后汇总到中央平台，这种模式令人印象深刻，也很荣幸能参与这场变革。尽管面临诸多挑战，他们依然如此坚定。

All of that be run through effectively a digital app, then two, how you can have a proactive way of informing citizens of things that might be happening, and then helping an education system that also run through this personal tutoring experience, and all of that is happening, so that was incredible to see, and the second amazing thing was that the way they've done it, so they have the digital transformation piece, but they have engineering leaders in each of the ministries that lead those efforts and then bring them back to that one central piece, so that is incredible to see and also proud to be able to be working with them on that shift, but despite everything that's happening, they're like so That's

Speaker 0

这确实令人鼓舞。

That's really encouraging.

Speaker 0

我能问一个关于商业模式的问题吗？

Can I ask you a business model question here?

Speaker 0

因为从战略格局来看，实际上我有很多疑问。

Because looking at the strategic landscape Actually, I have many questions here.

Speaker 0

我的一个观察是，如果看看这些丰富的语音和行动代理体验，很多《财富》500强和全球2000强的企业高管都在收听这个播客。

One of the observations I'd have is if I look at one of these like, rich voice and action agent experiences, there's a lot of, let's say, Fortune 500 Global 2,000 leaders who listen to the pod.

Speaker 0

我想他们很多人会认同这个理念：我想要为我的客户提供这种神奇的、自动化的、实时可用的、24/7多语言且高质量的一致体验。

I think a lot of them are gonna buy the idea of, like, I want this amazing, automatic, real time available, 20 fourseven, every language experience for my customer that's consistent and high quality.

Speaker 0

实现路径可能包括与Palantir这类大咨询公司合作，与Eleven或OpenAI等平台技术公司合作，对吧？

The ways I might get there include working with a Palantir or a large consulting firm, working with Eleven or a platform technology company, or like an OpenAI or something, right?

Speaker 0

我们来谈谈这个。

Let's talk about that.

Speaker 0

或者与像Sierra这样更注重具体应用场景的公司合作。

Or working with a sort of more use case oriented company like Sierra.

Speaker 0

对吧？

Right?

Speaker 0

你认为人们是如何做出这个决策的，或者说他们应该如何决策？

How do you think about how people are making that decision or how they should make that decision?

Speaker 1

我的背景也与Palantir相关，最初正是从那一方起步的，我们在公司内部也融合了大量前线部署工程的经验。

So my past is also in Palantir, so I started exactly from that side, and we do blend a lot of the forward deployed engineering inside of the company too.

Speaker 1

当我思考我们的产品方案以及客户做出选择时，如果你只寻求一个单点解决方案，那么我们很可能不是最佳选择。

As I think about kind of our offering and the customers making that choice, if you're looking just as a one pointed solution and only that one, then likely we aren't the best choice.

Speaker 1

若您希望将该平台部署于多种场景——无论是客户支持、内部培训，还是通过创新互动方式提升销售业绩，突破传统被动响应模式——这无疑是个绝佳的构建平台。我们实际与客户合作时，会将平台工程资源整合，协助企业落地应用。正如我们在财富500强和全球2000强企业中所见，他们往往倾向于自主开发部分功能（因其已具备相关平台投资），同时在新领域与我们合作实现功能整合。我认为我们模式的独特之处在于平台的高度开放性，客户可以按需选用平台模块而非强制全套方案。

If you are looking to deploy that across a plethora of different experiences, so be it customer support, but then you also want internal training, then you might want to elevate your sales part and actually increase the top line with new experiences of how you engage customers beyond that kind of reactive piece, then it's a great platform to build, and then we effectively, as we engage with customers, combine that platform work with our engineering resources to help those companies deploy on that, or which we also see increasingly in Fortune 500s, G2000s, where they will want to build parts of the things themselves because they already have a lot of the investments in that platform while then engage us on some of the new ones and combine those, And and and I think that our model and the way it's different to to a lot of the use case specific ones is that our platform is relatively open where you can use pieces of that platform and not all of them Mhmm.

Speaker 1

适用于各类应用场景。

For for those different use cases.

Speaker 1

当然，Palantir这类公司或某些咨询公司会拥有更多资源来支持更广泛的数字化转型进程。

Palantir, of course, will will or or some of the consulting companies will have a lot more resources to go in the wider digital transformation journey.

Speaker 1

对我们而言，则是专注于非常具体的对话代理。

In our case, it's, like, very specific conversational agents.

Speaker 0

嗯。

Mhmm.

Speaker 1

这就像如果你在寻找与客户交互的新界面，那无疑是最佳选择。

It's like if you are looking for new interface with customers, that's the the best way.

Speaker 1

像Sierra这样的公司，在思考具体针对性用例方面表现非常出色。

And companies like Sierra, phenomenal, of course, on how they are thinking about the specific pointed use case.

Speaker 1

另外，根据我们的工作性质，取决于你的优化目标——我们拥有众多国际合作伙伴。

And maybe the other piece is, as we think about our work, depending on how you are what you're optimizing for, so we we have a lot of international partners.

Speaker 1

如果你的用户群体分布更广、地域覆盖更全面，那正是我们的优势所在。

If you have, like, a a wider geographic user base, great.

Speaker 1

这就是我们重点优化的方向。

That's what we optimize for.

Speaker 1

我们的声音、语言以及对国际集成的支持范围要广泛得多。

Our voices, our languages, our support for integrations internationally are just so much broader.

Speaker 1

经常会有你需要深入了解的部分。

There's frequently a piece that you will look into.

Speaker 1

根据你的具体范围，这将是一个重要因素，但我想总结一下：如果你正在寻找一套适用于不同使用场景的解决方案，并希望获得我们的工程支持和部署，那么我们就是正确的选择，很可能也是最佳选择。

Depending on your exact scope, this will be a big factor, but I would summarize that if you are looking for a solution across a set of different use cases that you want our engineering help and deploy that, then we are the right solution and probably the best solution.

Speaker 0

我想稍微聊聊OpenAI和基础模型公司，比如LM基金会。

I wanna talk a little bit about maybe OpenAI and the foundation, LM foundation model companies.

Speaker 0

我和Elad将这个播客命名为‘无预设’的原因之一，就是我们意识到人们总是在对市场运作方式做出各种假设，而实际上，这些假设往往最终被证明是无稽之谈。

One of the reasons Elad and I called this podcast no priors is because we're like, okay, people are making a lot of assumptions all the time about how the market is gonna work, and lo and behold, like, many of those assumptions end up being nonsense, actually.

Speaker 0

在当下这个阶段，你必须坚定地选择自己的叙事方向。

And you you you can't you have to very much decide your own narrative at this point in time.

Speaker 0

我想，如果我说错了请纠正，在2022和23年你可能经常听到有人说‘谷歌能做到这个，OpenAI能做到那个，凭什么你们还能继续专注于语音这种通用能力？’

I think, correct me if I'm wrong, like, in 2022 and '23, you probably heard a lot of people say, like, Google can do this and OpenAI can do this, and like, why do you get to persist working on voice anyway as a general capability?

Speaker 0

答案是什么？

What's the answer?

Speaker 1

这也为之前的几个问题增添了另一个层面，无论是代理工作还是创意工作。

That also adds kind of another element to that, a couple of the other previous questions where whether it's agents' work, whether it's the creative work.

Speaker 1

要在这些工作中实现价值，你需要一个非常强大的产品层。

To deploy the value in those work, you need a very strong product layer.

Speaker 1

你需要集成。

You need integrations.

Speaker 1

你需要帮助人们部署工作，这是最常见的部分，但我们的超能力和长期专注点一直是构建基础模型，使体验真正无缝。

You need to help people deploy the work, which is the most common piece, but our superpower and our focus for a long time was building the foundational models to actually make that experience seamless.

Speaker 1

当我思考市场上的公司时，他们会优化很多其他方面，而这将成为差异化因素。

And as I think about the companies in the market, they will optimize for a lot of other things, and that will be the differentiator.

Speaker 1

就我们而言，我们将使整个体验——尤其是语音方面——更加无缝、人性化且能以更好的方式控制。

In our case where we will make the whole experience, especially with voice, seamless, human, controllable in a much better way.

Speaker 0

所以从根本上说，你认为实验室根本不会也不会关注这一点。

And so fundamentally, you would argue that the labs just aren't gonna focus on this and haven't.

Speaker 1

正是如此。

Exactly.

Speaker 1

因此我认为大多数这类公司，从长远来看，关键在于能否进行卓越的研究并打造出真正满足客户需求、从客户实际场景出发的卓越产品。

So I think most of those companies, and that's the thing about the long term, it's going to be incredible research and an incredible product that meets customers where they are and work backwards from there.

Speaker 1

我不认为实验室会专注于构建如此重要的产品层，但您问题的另一部分是在询问为何他们连研究层面都未能达到我们这样的质量——当然我也有主观倾向，但我们确实在文本转语音、语音转文本及协调机制等基准测试上领先他们，这要归功于我的联合创始人和团队。

I don't think the labs will focus on building that product layer that's so important, but I think the part of the question that you're asking is how or and why they haven't done even the research part to the quality that we've been able to as here, I'm also biased, but we are happily beating them on benchmarks with text to speech or speech to text or the orchestration mechanisms, and a credit to my cofounder and the team that they've been able to do it.

Speaker 1

这就像一位杰出的研究者持续深耕，但我认为音频领域的独特之处在于：相比规模效应，架构突破和模型突破才是关键。我们已多次实现这种突破，真正重要的是人才质量而非数量——我们估计全球可能只有50到100名研究者具备这种能力。

It's just a mighty researcher is just continuing their work, but I think the main part that I think is different in audio space is that you don't need the scale as much as you need the architectural breakthroughs, the model breakthroughs to really make a dent, we've been able to do that a couple of times, and I think the number of people doesn't matter, but the people that you do does, we think there's maybe 50 to 100 researchers in audio space that could do it.

Speaker 1

我们认为公司已聚集了其中约10位顶尖人才。这种让精英跨领域协作，并全公司全力支持他们将研究成果产品化、观察用户反馈的专注模式至关重要。

We think we have probably 10 of them in the company that are some of the best ones, and I think this obsession of just those people working across and then actually giving the full focus on the company on making them actually work on that and bringing their work to production, seeing how the users interact back was so important.

Speaker 1

这就是我们能够打造超越某些顶尖公司模型的原因。但说实话，他们未能做到这点本身也是个耐人寻味的未解之谜——毕竟他们也拥有非凡的人才储备。

So that's, I think, how we've been able to create models better than some of the top companies out there, but, you know, the truth is, to a large extent, they weren't able to do it is also an interesting we don't know.

Speaker 1

他们那里同样有着如此惊人的才华。

They have such an incredible talent there too.

Speaker 0

同时您如何看待开源模型的发展？

How do you think at the same time about open source models?

Speaker 1

公司里任何员工都会给出相同答案：我们长期认为模型终将趋于同质化，彼此差异会变得微不足道。

Anyone you ask in the company I think will say that same, and the second narrative we think about is in the long term, models will commoditize or the differences between will be negligible.

Speaker 1

对于某些使用场景来说，它们仍然很重要。

For some use cases, they will still matter.

Speaker 1

对于大多数他们想要的使用场景来说，

For most, like, most use cases they want, and

Speaker 0

而且它们将广泛可用且完全正确。

And they'll be broadly available and totally Exactly.

Speaker 0

同意

Agree with

Speaker 1

我们无法确定具体时间点，可能是两年、三年或四年，但这一趋势终将发生。

And we don't know where that is, whether it's two years, three years, four years, but it's going to happen at some stage.

Speaker 1

当然，届时在这些基础模型之上，微调层将变得至关重要，但我认为底层模型本身会变得相当出色。

Then, of course, you will have a fine tuning layer that will matter a lot on top of those models, but the base models, I think, will get pretty good.

Speaker 1

这就是为什么对我们而言，产品环节不仅从公司角度，更从价值角度都极为重要——拥有优质模型固然好，但要将业务逻辑与知识体系连接起来，为工作创建合适的广告界面或全新素材，这完全是另一层面的挑战。

And that's why, for us, the product piece is so important from the company perspective but also from the value perspective because if you have a model, that's great, but to actually connect your business logic and knowledge to to be able to have the right interface for creating an ad for your work or a completely new material, that's that's a very different exercise.

Speaker 1

若将开源模型分为两类来看，比如异步内容旁白领域，我认为旁白功能方面开源模型表现优异，商业模型也很出色，两者在开箱即用质量上的差距正在逐渐缩小。

But open source models are getting if I split it into two, like, of that async content narration, I think narration is pretty much open source is great, commercial models are great, differences are are getting smaller on the out of the box quality.

Speaker 1

大多数模型尚未解决、而我们曾思考过的问题，是如何实现可控性。

What most of the models haven't figured out, and I think we were, is how to make them controllable.

Speaker 1

这大致就是叙述部分的内容。

So that's kind of the narration piece.

Speaker 1

我认为整个交互环节——无论是串联式语音转文本、Lemp文本转语音方案，还是未来可能采用的联合训练融合方案——对客户支持或客户体验很有益处，但距离我们通过图灵测试的对话水平仍有差距，估计至少还需一年时间才能实现实时配音式的对话翻译变体，而这项技术可能还需两年左右。

I think the whole interaction piece of how you orchestrate the components together, whether that's cascaded speech to text, the Lemp text to speech approach, or whether in the future it's a fused approach where you train them together, I think this is good for customer support or customer experience, but it's still away from conversation like we have of passing that Turing test, I think this is still at least a year, like within a year, and then you will have real time dubbing kind of variation of real time translation conversation, and I think that's maybe more within two years away.

Speaker 0

你知道，有个让我虽感坦然却认为当前市场罕见的别扭观点：实际上多数技术优势的持续时间可能是一年，也可能是十年，但它们并非无限可持续的。

You know, a very uncomfortable belief that I feel comfortable having this belief, but I I think is uncommon in the market right now is that actually most advantages in technology, like, they could they could last you a year or they could last you ten, but they're not, like, infinitely defensible.

Speaker 0

如果从模型质量或产品角度来看，它们能让你更好地服务客户，在一段时间内积累势头和规模。

And if you think about that from a model quality perspective or a product perspective, they allow you to, like, serve the customer better and build momentum and build scale for some period of time.

Speaker 0

实际上，这种力量会随着时间推移变得非常强大，对吧？

And actually, that's really powerful over time, right?

Speaker 0

但这并非一个一劳永逸的明确答案，所以我认为这让商业人士和投资者感到不安。

But it's not like a clean forever answer, and so I think that makes, I don't know, business people and investors uncomfortable.

Speaker 1

而且我的意思是，这也非常正确。

And I mean, it's very true as well.

Speaker 0

我是这么看的。

I see it as.

Speaker 1

我是说，按你的思路来看，研究就是先发优势。

I mean, the way you think about it, research is head start.

Speaker 1

这让我们能更早为客户提供优势，而且是六到十二个月的领先优势。

This gives us we can give advantage to the customer earlier, and it's six, twelve months of advantage.

Speaker 1

这也是我们为你构建正确产品层的方式，以获取研究的最佳成果。

That is also a way for us to build a right product layer for you to get best of that research.

Speaker 1

我们经常并行推进，所以研究成果一出来产品就能跟上——因为我们清楚自己的规划，知道什么产品最合适。这样研究和产品同步延伸，但真正带来长期价值的是你构建的生态系统，无论是品牌与分销网络，还是可集成的语音库、可搭建的工作流。

Frequently, we do that in parallel, so the moment the research is out there, you have the product because we know our initiatives, we know what the product is that's right, so you have research, product in parallel that extends that, but the thing that will really give that long term value is the ecosystem that you create around, whether that's the brand and distribution, whether that's the collection of voices you can have, the collection of integrations you can build, the workflows that you can build.

Speaker 1

我认为这是我们心中的推进逻辑：研究-产品-生态系统的构建。而研究本质上只是先发优势，让我们能稍微加速未来的到来。

I think that's the way we kind of sequence that in our mind, that research product ecosystem that we built, And research, all it is, is a head start and being able to accelerate the future a little bit closer.

Speaker 0

我觉得这是个非常深刻的见解，尤其当研究团队和公司内部团队也认同这一点时。

I think that's a really powerful insight, especially if the research team and the company team believe that as well internally.

Speaker 1

我认为对我们来说有趣的是——这也是所有兼顾研究与产品的公司面临的大问题——是等待研究成果，还是直接推进产品变革？

I think the piece that was interesting for us is, and I think this is the big questions for all companies that do research and product, is do you wait for research, or do you do a product change?

Speaker 1

或者不仅仅是研究产品的公司，你会等待别人做研究吗？)

Or even not only research product companies, do you wait for someone else to do the research?

Speaker 1

因为时间线并不明确。

Because the timeline for that isn't clear.

Speaker 1

是三个月、六个月还是十二个月？.

Is it three months, six months, twelve months?

Speaker 1

你并不清楚它

You don't know exactly what it will do, which is the hard choice of do I invest into product layer, or do I just wait more for the research?

Speaker 1

所以在我们的案例中，我们内部让所有产品团队开展研究计划以便并行推进工作，但不会限制他们——如果某产品团队认为应该通过不同方式为客户创造价值，他们就可以行动，大致的经验法则是三个月期限。

So in our case, we internally let all the product teams do research initiatives so we can paralyze that work, but we don't hold them that if a product team thinks we should deliver value to the customer by doing something different, they can, and rough rule of thumb is like three months.

Speaker 1

如果我们认为某项工作会超过三个月，我们很可能会着手构建它。

If we think it's going to be longer than three months, we will probably build it.

Speaker 1

如果短于这个期限，我们很可能就不会做。

If it's less than that, we probably won't.

Speaker 0

你能谈谈你们目前正在进行的一些研究吗？以及你们如何看待交付节奏和哪些项目值得投入？

Can you talk about some of the research that you're doing now and then how you think about the cadence of delivery and what's worth working on?

Speaker 1

我们现在在音频领域有多个不同的项目，大致可以分为两大类，基本上都与创作和代理方面相关。

We have now a number of different initiatives across the audio space, and there are kind of two big buckets, and roughly they will relate to that creative and agent side.

Speaker 1

在创作方面，我们开发了可控的文本转语音0语音模型。

On the creative side, what this means, we did text to speech models that are controllable.

Speaker 1

随后，我们添加了高准确率的语音转文本模型，支持低资源语言，覆盖近100种语言。

We then added speech to text model that transcribes in a high accurate way, but across a low resource languages as well, so covering almost 100 languages.

Speaker 1

然后我们创建了一个完全授权的音乐模型，展望未来，这些模型还将与视觉领域进行交互。

Then we created a music model, a fully licensed music model, and as you think about the future, it's how those models will also interact with some of the visual space.

Speaker 1

因此需要投入大量精力来优化音频质量，并可能将其与现有视频结合，以实现最佳的传输效果。

So that's a lot of effort in how you can get the best of audio and then potentially combine that with existing video that you have to really have the best delivery.

Speaker 1

而在代理端，当然就是如何优化实时语音转文本、实时文本转语音的性能。

And then on the agent side, it's of course how you optimize the real time speech to text, real time text to speech.

Speaker 1

我们刚刚发布了语音转文字模型Scryvy二代，在Flowers平台上对前30种语言的准确率达到了93.5%，模型大小控制在150以内。

We just released our speech to text model, Scryvy two, which is under a 150, 93.5% accuracy across the top 30 languages on on Flowers.

Speaker 1

选择前30种语言是因为我们服务众多其他语种（但大多数厂商不覆盖这些语种），目前它在所有基准测试中都击败了其他模型。而展望未来，关键还在于如何整合语音转文字、大语言模型和文字转语音的协同运作体系。

And it's only top 30 here because we serve so many others, but most of the people don't, so it's beating all the models on benchmarks, but as you think about the future, it's also the orchestration piece of how you bring speech to text, LLM and text to speech.

Speaker 1

我们将在未来几个月内发布一个新的编排机制，这将大幅降低端到端的处理环节。更重要的是，这项技术不仅能让你组合这些模块，还能为对话添加情感上下文，使你可以与模型进行更具表现力、更自然的互动。

We'll be releasing over the next couple of months a new orchestration mechanism that will lower the end to end part, we think, in a great way, but second thing, which is what is so hard, is it's not going to only allow you to combine those pieces, but add also the emotional context of the conversation so you can actually respond with the model and we think in more expressive and in a better way.

Speaker 1

未来我们重点投资的另一个方向是语音到语音的并行化处理，采用更融合的技术路径。

And in the future, and something we're investing is parallelizing a speech to speech, more fused approach as well.

Speaker 1

当然根据使用场景不同——如果是企业级可靠性需求，接下来一年阶梯式架构仍是首选方案，因为这种结构更...

And of course, depending on the use case, if you are enterprise reliable use case, the cascaded approach is the approach for the next year to Has more structure, yeah.

Speaker 1

更结构化意味着每个处理环节都具有更高的可观测性。

More structure, you have more visibility into each of the steps.

Speaker 1

这种架构更可靠。

It's reliable.

Speaker 1

我称之为工具化方案。

You can I'll call it tools.

Speaker 1

如果你追求表现力且能接受一定程度的幻觉生成，语音直连方案可能是更好的选择。随着时间的推移，不同行业可能会交替采用这两种方案——这正是我们重点投入的领域：构建融合音频与其他模态优势的多样化模型平台。

If you're feeling more expressive and can hallucinate, speech to speech might be the choice, and maybe over time, you'll see them kind of go one over another depending the industry, but that's a huge investment on our side, which is where the foundation of all the platform and the main part that we are continually investing in is a plethora of different models that combine the best of audio with some of the best of the other modalities together.

Speaker 0

我想用最后几分钟请教几个关于未来的问题，考虑到您长期专注于语音音频领域，相信您会有独到见解。

I wanna take our last few minutes and ask you a few questions about the future that I think you'll have a really good point of view on given you think about voice and audio all the time.

Speaker 0

你对AI伴侣怎么看？

What do think of AI companions?

Speaker 1

我认为它们将会非常重要且广泛存在。

I think they will be a big thing and exist in a big way.

Speaker 1

虽然这不是我个人特别热衷的领域，也不是我们投入大量时间的方面，但我认为整个助手、伴侣、角色这一系列概念，作为体验的一部分，在很大程度上是模糊且平淡的。

Not something I'm personally excited about or something that we spend much time on, but I think the whole line of of, let's say, assistant, companion, character that you enjoy as part of experience were kind of blurry and bland to a large extent.

Speaker 0

我觉得这会很普遍，但你个人对此并不热衷。

I think can be very common, but you're not enthusiastic personally about it.

Speaker 1

我更期待的是类似贾维斯版本或者超级助手超级领航员那样的发展。

I'm more excited about more of the Jarvis version of that or more of like I have a super assistant super pilot.

Speaker 0

而非社交版本。

Versus the social version.

Speaker 1

而非社交版本。

Versus the social version.

Speaker 1

我认为这将带来难以置信的突破，而且它还能与个人生活场景完美融合。

That's like I think it just would be like such an incredible unlock, and it's also like it's in a and it's something blending in the personal life context.

Speaker 1

我希望能这样开始一天：有个懂我的人告诉我哪些事与我相关，帮我拉开窗帘，然后告诉我天气和阳光情况。

I would love to start the day and, like, someone that understands me and, like, start and tell me what's, like, relevant to me and open the blinds and then tell me about the weather and the sunshine.

Speaker 1

它还会立即播放音乐。

It'll be and play music straight away.

Speaker 0

这一定会实现的。

It's gonna happen.

Speaker 1

这一定会实现的。

It's gonna happen.

Speaker 1

这正是我期待的。

That's I'm excited for.

Speaker 1

我认为伴侣型AI会提及解决孤独感这方面。

I think the companion use cases will mention solving loneliness in that part.

Speaker 1

我觉得这是一种方式。

I think that's one way.

Speaker 1

或许还有其他方式能重新吸引人们参与。

Maybe there are different ways of engaging people back.

Speaker 1

我确实认为未来会很有趣，即便在教育领域，你将拥有从AI导师那里学习的超能力，但另一方面，这是我的个人观点，你会有一部分时间与AI导师共度，但也会明确安排一部分时间完全脱离技术，进行人与人之间的互动，这样你也能学到那一部分。

I do think there will be an interesting future, even if you think about education where you will have superpower with learning from AI tutors, but I think on the flip side of that, and I think this will like, that's my personal take, you will have education, a good percent of time spent with AI tutors, but then explicit percent of time spent without any technology, human to human, so you kind of learn that part too.

Speaker 0

是的。

Yeah.

Speaker 0

我认为这是正确的模式，无论是在情感引导、教练式指导和护栏方面，还是在点对点学习方面。

I think this is the correct model, both in terms of emotional guidance and coaching and guardrails, as well as peer to peer learning.

Speaker 1

是的。

Yeah.

Speaker 1

确实如此。

Exactly.

Speaker 0

对于听写功能，或者在我们控制那些不一定拟人化的技术方面，你有什么看法？

What do you think about dictation or what happens in terms of how we, like, control technology that isn't necessarily personified as well?

Speaker 0

还是说所有技术最终都会拟人化？

Or does it just all become personified?

Speaker 1

我认为并非所有技术都会拟人化。

I think not all personified.

展开剩余字幕（还有 36 条）

Speaker 1

我认为，比如一些，你知道的，通讯设备和家用电器可能会保持相对静态，而且

I think, like, some, you know, communicative and oven and home probably will, like, stay pretty static and

Speaker 0

或者代码。

Or code.

Speaker 0

我可能

I might

Speaker 1

就是对。

just Yeah.

Speaker 1

没错。

Exactly.

Speaker 1

比如，你可能不需要那么多额外的情感输入，但我觉得是的。

Like, you don't probably need that much of of, like, additional emotional input, but but I think it's Yeah.

Speaker 1

这将是一个重要部分，某种程度上，我希望未来你能有更多机会沉浸在现实生活中，让设备回归口袋，或者变成某种随身附件——只要场景合适——它们就能代表你行事。

It's going to be a huge part where, like, in a way, what I hope will happen is you will have ability to, like, stay more immersed in the real life with the devices going back into the pocket, back into some version of an attached element, assuming that's in the right setting, and that kind of acts on your behalf.

Speaker 1

从很多方面来说，比如语音输入，正如Carpathi所说，这是智能代理的十年。

And in many ways, like, let's say dictation, it's, as Carpathi says, decade of agents.

Speaker 1

我们称之为十年。

Let's call it decade.

Speaker 1

然后将会迎来机器人十年。

Then you'll have a decade of robots.

Speaker 1

如果你与机器人互动，语音自然将成为关键输入输出接口之一，所以语音输入将占据重要部分。

If you are interacting with robots, of course voice will be the input and the output as one of the key interfaces, so you will need that dictation as a huge part.

Speaker 1

但同样地

But similarly

Speaker 0

我认为机器人将会拟人化。

I think the robot's gonna be personified.

Speaker 1

是的。

Yeah.

Speaker 1

相对的。

Relative.

Speaker 1

百分之百。

100%.

Speaker 1

100%。

100%.

Speaker 1

是的。

Yeah.

Speaker 1

不。

No.

Speaker 1

我认为大多数应用场景都会拟人化。

Like, I think most of the use cases will be personified.

Speaker 1

嗯。

Mhmm.

Speaker 0

好的，最后一个问题。

Okay, last one.

Speaker 0

你认为现在已经存在的，或者未来几年内会改变我们与内容互动方式的一件事是什么？

What's, like, one thing that you've seen already exist today, or if you project out a few years, will change about how we interact with content?

Speaker 0

可能是像个性化的语音内容，或者是人们将用AI语音做一些现在不做或还不为人所知的事情。

Maybe it's, like, personalized voice content or just something people are gonna do with AI voice that they don't do today or that not everybody knows about.

Speaker 1

我认为目前尚未融入系统的最大变革将是教育方式的转变。

I think that's still the biggest one that hasn't yet kicked into the system is like how education will be on.

Speaker 1

我认为这个目标将是与AI语音共同学习——无论是通过耳机还是扬声器，这将成为一个巨大的趋势，你可以随时拥有理解你、高度个性化的私人教师，根据你的生活提供精准内容，我认为这将是最大的应用场景之一，但目前尚未实现。

I think this goal will be I think learning with AI with voice where it's on your headphone or on a speaker, it's just going to be such a big thing where you have your own teacher on demand and who understands you very personified and kind of delivers the right content through your life, I think this will be one of the biggest use cases, and I don't think it happened yet.

Speaker 1

当然我们已经看到一些商业合作伙伴的应用，但学校、大学如何以保障安全的方式部署，同时支持教育的另一面——社交教育部分，我认为所有这些都将发展。或许未来会出现很酷的版本，比如由理查德·费曼或阿尔伯特·爱因斯坦来授课，或是你喜欢的其他老师来讲解课程内容。

I think we see of course some of the commercial partners, but schools, universities, how that's deployed in a safeguarded way, in a way that supports the other part of the education, the social part of education, think all of that will evolve, and maybe there's a cool version of that where you have Richard Feynman or Albert Einstein deliver those lecture notes or other teachers that you love.

Speaker 1

那将会非常棒。

It'll sick.

Speaker 0

这是个很好的结束话题。

That's a great note to end on.

Speaker 0

麦迪，感谢你参与这次对话。

Thanks for doing this, Maddie.

Speaker 1

塔拉，非常感谢你。

Tara, thanks so much.