本集简介
双语字幕
仅展示文本字幕,不包含中文音频;想边听边看,请使用 Bayt 播客 App。
大家好,欢迎回到DeepMind播客。本期节目将探讨人工智能如何已经在影响我们周围的世界。我们可以开始了吗?抱歉,你们怎么没等我就开始了?
Hello, and welcome back to DeepMind, the podcast. This episode is all about how AI is already having an impact on the world around us. Shall we begin? Excuse me. What are you doing starting without me?
我才是真正的汉娜·弗莱,只是想帮忙而已。听说你无法主持这期节目,所以我主动提出代班。无法主持?接下来交给我吧。
I'm the real Hannah Fry. I'm only trying to help. I heard you were unavailable to present this episode, so I offered to step in. Unavailable? I'll take it from here.
非常感谢。刚才你们听到的声音是用WaveNet语音合成技术生成的,该技术通过我的声音录音训练而成。本期节目我们将了解DeepMind技术在实际中的一些应用,包括WaveNet如何为发声障碍者重建声音。
Thank you very much. That voice you just heard there, it was generated using WaveNet, voice synthesis technology, trained on audio recordings of my voice. In this episode, we're going to be taking a look at some of the ways that DeepMind's technology is already being used out in the real world, including how WaveNet can recreate the voices of people with vocal impairments.
看到他和家人听到这个声音时的反应真的很感人,家人们都哭了,因为这是如此私人的东西。
It was really touching to see his family and him listen to the voice, his family cried, because it's something that's so personal.
神经网络如何帮助预测自然灾害。
How neural networks can help anticipate natural disasters.
重要的是要预知是否会形成灾难性风暴并引发洪水。
It is important to know if there's going to be a buildup of a catastrophic storm that's going to create flooding.
以及人工智能如何可能改变足球运动。
And how AI could even transform the game of football.
教练可能会对系统说:如果我把法比尼奥从中场调到后卫会怎样?
So a coach might say to the system, what will happen if I move Fabinho from defense to midfield?
欢迎收听DeepMind播客第七期——我、自我与AI。让我们回到这期开头由WaveNet生成的那段音频。熟悉DeepMind播客的听众可能记得,WaveNet不仅能生成语音,还能创作音乐,我们在第一季中就曾少量使用过。
Welcome to episode seven of the DeepMind Podcast. Me, myself, and AI. Let's go back to that snippet of audio from the beginning of this episode generated by WaveNet. The DeepMind podcast aficionados among you may remember that WaveNet doesn't just generate speech. It can also compose music, and we used a little bit of it in our first series.
但在生成拟人语音方面,WaveNet过去几年取得了显著进步。不过其初衷始终未变——从为视障人士朗读文档,到让智能音箱发声更自然。DeepMind应用团队产品经理扎卡里·格莱舍是这样描述的:
But when it comes to creating human sounding voices, WaveNet has improved considerably over the past few years. The motivation, however, has stayed the same. Everything from reading documents out loud for the visually impaired to making your smart speaker sound more natural. Here's how Zachary Gleischer, a product manager on DeepMind's applied team, put it.
文本转语音研究已持续数十年,众所周知传统文本转语音听起来非常机械。典型的例子就是史蒂芬·霍金使用的语音。
Text to speech research has been happening for decades, and everyone knows that text to speech voices have historically sounded pretty robotic. A classic text to speech voice is the Stephen Hawking voice.
英国人认为它的口音是美式的。
British people describe its accent as American.
这是他在1999年BBC广播四台《荒岛唱片》节目中的发言。
Here he is speaking on BBC Radio four's Desert Island Discs in 1999.
但美国人却说它是斯堪的纳维亚口音。
But the Americans say it is Scandinavian.
人们并非想要机器人般的声音,而是因为这是个极具挑战性的难题。人类进化出了识别事物细微差别的能力,只要有一丁点不对劲,人们就会觉得‘呃,这听起来好机械’。如果我们做个狗吠生成器,人们会说‘天啊,简直和真狗一样’,因为你无法分辨差异——我们大脑没受过专业训练来判断狗吠声的好坏。
It's not because people want robotic voices. It's because it's an extremely challenging problem. Humans have evolved to be able to understand very subtle nuances in how things are set, and if there's one little thing that sounds off, then people are like, Ugh, that sounds robotic. If we were to create a dog barking generator, people would be like, oh my god, that sounds just like a dog. And you wouldn't be able to perceive any of the differences because our brain's not trained to know what good dog barking sounds like.
而此时你家狗在角落翻白眼:假得要命。
Meanwhile, your dog's in the corner being like, it's so fake.
没错。在WaveNet之前,主流语音生成技术是拼接式语音合成。需要请人在录音棚录制数小时,尽可能捕捉字母表中所有音素,形成丰富的录音库。实际使用时就把这些录音片段拼接起来。比如想说‘猫坐在垫子上’,就拼接‘the’和‘cat’的录音片段。
Exactly. Before WaveNet, the general method for generating speech was called concatenative text to speech. You'd get someone in a recording studio and you'd record hours and hours trying to capture all the phonemes in the alphabet so that you have a real diverse recording set. In production, you stitch together the voice recordings. So imagine you wanted to say the cat sat on the mat, and you had a recording of someone saying the word the, and you had a recording of someone saying cat, you could stitch those two words together.
但问题在于合成声音会显得很生硬。
But the problem there is that the voice is going to sound like that.
WaveNet不再拼接预录的单词音节,而是直接建模原始声波,每次生成不足毫秒的音频。首先它会扫描文本中的缩写并转换,比如把‘hwy 101’转成‘highway 101’。第二步根据上下文预测语调——‘the’在句中不同位置可能读作‘the’、‘thee’或‘thuh’,用错语境就会很别扭。
Instead of stitching different bits of prerecorded words and syllables together, WaveNet directly models the raw waveform of the voice, building up less than a millisecond of audio at a time. First, it will scan the text you give it for abbreviations and convert them to something that can be fed into the speech generator, like changing h w y one zero one to highway one zero one. The second step is to try and predict the intonation of how something should be said based on the text around it. The can be read as the, the, the, depending on where in a sentence it falls. Each would sound wrong if it was used in the wrong context.
第三阶段是声学建模,重点在于声音特质。比如我模仿弟弟打电话,朋友还是能听出是我。即使用不同语气说话,你仍能认出我的声音。
Now the third and final part is the acoustic modeling. Acoustic modeling focuses on who it sounds like. If I pretend to sound like my brother on the phone, it still sounds like me. My friend will be able to tell it's me. If I say a sentence with a different tone of voice, you still know it's my voice.
2016年DeepMind推出WaveNet时,需要约4小时音频样本才能建模人声。现在只需几分钟录音。关键突破是微调技术,能实现多声音协同训练。谷歌建立了庞大数据库,由专业配音演员朗读相同文本,模型从中学习特定单词的发音方式。
Back when DeepMind launched WaveNet in 2016, you needed about four hours worth of audio samples from a person to model how their voice sounds. But now you can do it with just a few minutes worth of audio. One of the big breakthroughs was a process called fine tuning, which makes it possible to co train voices together. Google has built an enormous dataset with professional voice actors reading out the same text. The model learns from all of these samples how particular words are pronounced.
每新增一个声音到数据库,都会提升所有其他声音的质量。之后只需要一小段新声音样本作为最后的润色,可以说,这能让声音具有该说话者的独特性。
Each new voice that is added to the database results in an improvement to all of the other voices. And all that's then needed is a small sample of a new voice to provide the finishing touches, if you like, that make the voice unique to that person.
这就是我们称之为微调的原因,因为它能基于新增的说话者来优化模型。
That's why we call it fine tuning, because it's a way to just kind of fine tune the model based off that one additional speaker.
因为比如你的声音和我的声音之间的差异,尽管你说话带不同口音,实际上你的声音是男声,但我们在词与词之间的过渡方式会有很多相似之处。
Because the difference between your voice and my voice, for instance, even though you speak in a different accent, yours is a male voice, actually, the way that we roll from one word to another will have lots of similarities.
是的,当然。
Yeah. Of course.
在获得声音提供者至关重要的同意后,只需录制约十分钟的高质量音频,并将其与文字转录匹配,就能用于训练模型。
After getting the all important consent from the person whose voice you're creating, it's as simple as recording around ten minutes of high quality audio and matching those up with written transcripts for the model to be trained on.
由于你作为播客创作者拥有高保真度的优质录音,我们无需让你重返录音室就能完成。我曾有个汉娜·弗莱语音机器人,可以输入任何内容。
Given that you had high fidelity, good recordings of yourself because you are a podcast creator, we were able to do that without having to send you back into the recording studio. I had a Hannah Fry voice bot where I could type anything.
这力量。这力量。大家好,我是对人工智能着迷的数学家、作家兼播客主持人。我是真正的汉娜·弗莱博士本人。
The power. The power. Hi there. I'm a mathematician, author, and podcaster who's fascinated by artificial intelligence. And I'm the real doctor Hannah Fry.
确实很棒。
It's really good.
我知道。确实不错。
I know. It's good.
最尴尬的是它捕捉到了几种我明知自己会用的语调,却从未真正注意到。是啊,你听到那段精彩的部分了吗?
What's really awkward is that it's picked up on a couple of the intonations that I know I must make, but had never really noticed that I'd make. Yeah. Like, did you hear it went fascinating? I
太喜欢了。看,它比你自己还了解你。
love it. See, it knows you better than you know yourself.
哦,太尴尬了。天啊,我平时说话真是这样的吗?有两个词我觉得发音有点怪。一个是‘数学家’,中间有个音节特别生硬,这完全不是我平时的发音方式。
Oh, so cringe. God, do I sound like that. There were two words that I thought sounded a bit off. One was mathematician. There was, like, a very hard bit in the middle, which I think isn't how I would say the word.
我是数学家、作家兼播客主持人,我是真正的汉娜·弗莱博士。第二处是我名字的发音。就像有人报你电话号码时,本该说‘07813’却说成‘7813’,你会觉得脑子里有什么不对劲。
I'm a mathematician, author, and podcaster, and I'm the real doctor Hannah Fry. The second thing was how I said my name. It's a bit like, you know, when someone reads your phone number back to you, but rather than saying, like, 07813, they're like, 7813, then you're like, there's something's gone wrong in my mind.
没错。如果你合成长句子,会发现有些地方听起来怪怪的。文本转语音技术尚未完善,虽然现在很多情况下语音已经听起来非常自然了。
Yeah. Yeah. If you synthesize long sentences, you'll notice there are some things where you'll be like, oh, that sounded a little weird. Text to speech isn't solved yet. We've reached the point where voices sound perfectly natural in many instances.
当前的挑战主要在于系统在特定语境下表现的自然程度。举例来说,如果我需要文本转语音系统说'哦,汉娜,我真的很喜欢你的毛衣',然后用讽刺的语气再说一遍'汉娜。哦,我真的很喜欢你的毛衣'。
And the challenge now is largely about how natural it is given a certain context. For example, if I wanted a text to speech system that would say, Oh, Hannah, I really like your sweater. Now you say it sarcastically. Hannah. Oh, I really like your sweater.
这正是我们的不足之处——如何完整捕捉所有细节?
That's where we're really lacking is like, how do you capture everything?
当我不试图破解千禧年大奖难题时,你总能看到我端着茶杯,配着英式下午茶经典点心司康饼悠闲放松。我的声音有这么气音吗?见鬼。听起来像在打色情电话。另外值得一提的是,我现在和你交谈时这种抑扬顿挫、充满活力的状态,其实是另一种完全不同的发声方式。
When I'm not trying to crack one of the unsolved millennium prize problems, you can find me chillaxing with a cup of tea and that quintessentially British tea time snack, a scone. Am I that breathy? Bloody hell. Sound like I'm on a sex line. The other thing that's worth saying actually about that, the way I'm having a conversation with you now where I'm a little bit more up and down and a bit more energetic, say, is a different sort of voice.
训练使用的音频素材来自我朗读的第一季剧本,所以最终呈现不可避免地会带有那种风格特征。
The audio that this was trained on was the script that I read out for series one. So inevitably then it will end up being in that style.
没错。人们总会要求'让声音听起来开心点'或'让声音显得悲伤些'。但如果没有样本参考,这非常困难,因为模型需要学习'开心的汉娜'该是什么声调。
Exactly. People will be like, make the voice sound happier. Make the voice sound sadder. And that's really hard if you don't have examples, because the model has to learn what happy Hannah sounds like.
不过最终能否开发出这样的系统?让AI理解愉悦的语调与播客剧本朗读的区别,并做出相应调整?
Eventually though, could you have a system where the AI understands how a happy voice differs from one that's reading a podcast script and can make those changes appropriately?
是的,你可以让汉娜整体听起来更愉悦。但对于非常熟悉你的人,他们会察觉出细微的不协调——因为你的声音可能有某些独特习惯,只有听过你实际表达才能学会。比如你开心时总会拉长某个词的发音。
Yeah, you can make Hannah sound generally happier. But for people who know you really well, it's just like, wait, it sounds slightly off because you might have certain quirks about your voice that can only be learned if you hear how you say something, You know, if you always elongate a certain word when you're happy.
当然,人们都这样。就是很开心。
Sure, as people do. Just so happy.
是啊,我不确定人在兴奋时是不是都这么说话。太棒了!不过每个人都有怪癖。接下来的录音暗示了这项技术可能被危险利用——你可以用语音合成器说任何话。
Yeah, I don't know if that's how you say things when you're excited. Fabulous! I don't know, everyone has their quirks. This next recording hints at how this could be used dangerously. You could use the text to speech synthesizer to say anything.
你好,我是汉娜·弗莱博士。我要告诉你们:UFO是真实存在的。昨天我在花园里发现了草坪上这些奇怪的黑色圆圈。如何确保这项技术不被用于邪恶目的?
Hello, I'm Doctor. Hannah Fry, and I'm here to tell you that UFOs are real. When I went into my garden yesterday, I noticed these strange dark circles on my lawn. How can you make sure that it's not used for nefarious purposes?
我们深入思考过这项技术可能被滥用的问题。最关注的是确保人们的声音不会在未经同意的情况下被复制。因此我们没有开源模型,也没有公开数据集,以降低风险。但还有很多很酷的防范措施——比如必须通过脚本才能创建声音,而这个脚本可以是本人说'我同意复制我的声音'。
We've thought a lot about this technology on how it could be abused. I think the thing that we care most about is that people's voices are not created without their consent. That's why we have not open sourced the models, we haven't made the data sets available, to mitigate a lot of those risks. But also there's a lot of cool mitigations. I think one that excites me is that you need a script to be able to create a voice, and you could have that script be you saying that I give consent for my voice to be created.
目前有研究正在给音频添加数字水印。是
There is some research that's being done that watermarks audio. Is the
这个构想是:在创建人工语音时,刻意嵌入微小的音频特征。这些特征或许能用特定软件检测到,但人耳听不见。这样就能辨别出'啊,这个是伪造的'。
idea that in creating this artificial voice, you deliberately imprint tiny audio signatures that you could see with a certain piece of software perhaps, but that are inaudible to the human ear. So then you can go in and say, ah, look, this one is fake.
没错。但问题是水印可能被去除,人们可能不会怀疑真伪。况且很多公司都在发布这类技术——DeepMind并不掌握独家秘方。要完全阻止技术被滥用没有万全之策,只能尽量降低风险。就像现在人们看到照片都会想'这图修过吧?',未来对音频也会是同样态度。无论好坏,人们将不再轻信录音内容。
Exactly. But here's the thing, watermarks could be removed, people might not consider that it's fake, and there are a lot of companies who are releasing this technology. It's not like DeepMind has the secret sauce there's no surefire, silver bullet way to stop this technology being used by harm, there's ways to mitigate it, but the same way that we don't trust photos today everyone sees a photo and be like, is that photoshopped? I think it's going to be the same with audio. For better or for worse, I think people are going to just not trust what people are saying within an audio recording.
看到人们不负责任地使用这项技术令人遗憾,因为它可能会破坏许多对社会真正有益的用例。
And it's unfortunate to see people using it irresponsibly because it might spoil a lot of use cases that are really helpful for society.
扎克里告诉我关于DeepMind与谷歌合作的一个名为'优声计划'的项目,其中使用了WaveNet技术来重建美国橄榄球运动员蒂姆·肖的声音。蒂姆在2013年被诊断出患有ALS(一种导致语言障碍的渐进性神经系统疾病),他是WaveNet技术特别合适的应用对象。
Zachary told me about a partnership between DeepMind and Google called Project Euphonia, in which WaveNet technology was used to recreate the voice of Tim Shaw, an American footballer, who was diagnosed in 2013 with ALS, a progressive neurological disease that causes speech impairment. Tim was a particularly good candidate for WaveNet.
许多被诊断出ALS的患者会被建议进行语音存储,他们会录制自己的声音以便未来回放,比如他们喜欢唱的歌曲。但并非所有人都会这么做。而蒂姆·肖由于经常接受电视采访,拥有大量自己的录音素材。
A lot of people who get diagnosed with ALS are asked to do some voice banking, where they'll record themselves so that they can replay their voice in the future, like if there's a song that they love to sing, but not everyone does that. And with Tim Shaw, he had a lot of recordings of himself because he was interviewed on TV.
那种赛前令人振奋的电流感。紧张感如蝴蝶振翅,我已经准备好冲撞了——你最好当心点。
It's that amazing pregame electricity. Butterflies are there, and I'm ready to hit somebody. So you might wanna look out.
研究人员用三十分钟的录音素材合成了蒂姆·肖的声音。遗憾的是,当蒂姆与家人坐下来聆听多年后首次重现的自己声音时,他竟难以辨认。
Researchers used thirty minutes of recordings to create Tim Shaw's synthesized voice. Unfortunately, when Tim sat down with his family to hear his own voice for the first time in years, he struggled to recognize it.
我已经太久没有听到自己这样的声音了。
It has been so long since I've found it like that.
如果你听到自己小时候的录音,可能会疑惑:'我当初是这种声音吗?'但其他人确实记得你的声音特征——他的家人就记得。
If you hear an old recording of yourself when you were a kid, you're like, Did I sound like that? But other people do remember how your voice sounded. His family did.
我想向你解释为什么对我来说,完整说出这个诊断结果如此困难。
I want to explain to you why it's so difficult for me to speak the diagnosis, all of it.
正是他的声音,我已经遗忘了。
It's his voice that I've forgotten.
他的家人哭了,因为这件事太私人了。它是你身份认同的核心部分。
His family cried because it's something that's so personal. It's such a key part of your identity.
要让这项技术更广泛地惠及ALS患者,仍有更多工作要做。目前存在技术难点——像霍金教授使用的辅助沟通设备通常无法联网。遗憾的是,这些模型体积过大,无法在本地设备运行。因此必须能快速将数据传输到服务器才能使其工作。
There is still more work to be done to make this technology more widely accessible to ALS patients. It's tricky at the moment because the augmentative communication devices that people like Professor Stephen Hawking used to speak are generally not connected to the Internet. And unfortunately, these models are far too large to be run locally on a device. So you must be able to quickly send data through to a server to get them to work.
这确实是一个正在重点投入的研发方向。我相信假以时日,人们将能在设备上使用这些语音进行日常交流。
That's a work stream that definitely is being invested in. And I think in time, people will be able to have these voices on their device so that they can use to communicate every day.
虽然语音合成是AI开始影响人们生活的一种非常个人化的方式,但还有其他项目关注着无论我们是否愿意都会影响所有人的事物——天气。DeepMind最近与英国国家气象局的研究人员展开了合作。
While speech synthesis is a very personal way in which AI is beginning to touch some people's lives, there are other projects concerned with something that affects all of us, whether we like it or not, the weather. DeepMind has recently teamed up with researchers at the Met Office, The UK's national forecasting service.
或许当人们想到英国气象局时,会联想到气象学家迈克尔·菲什——在我出生前,他曾因著名地预测'不会有风暴'而闻名。
Perhaps when people think of the UK Met Office, they think of Michael Fish, this meteorologist who famously predicted no storm back before I was born.
我是气象局的尼尔·罗宾逊博士。尼尔正在讲述1987年一位天气预报员臭名昭著的案例,当时他向观众保证不会有飓风来袭。
This is doctor Neil Robinson from the Met Office. Neil here is talking about an infamous case of a weather forecaster who, in 1987, assured viewers that there was no hurricane on the way.
下午好。今天早些时候,有位女士打电话给BBC,说她听说飓风要来了。嗯,如果你正在收看,别担心。其实并没有。不过话说回来,实际上那股气流——
Good afternoon to you. Earlier on today, apparently, a woman rang the BBC and said she heard that that there's a hurricane on the way. Well, if you're watching, don't worry. There isn't. But having said that, actually, the breath of
这场后来被称为'大风暴'的灾害,最终成为三个世纪以来袭击英格兰东南部的最严重风暴。当然,如今的天气预报建立在极其复杂的数学模型基础上,这些模型需要处理海量数据。
The great storm, as it came to be known, turned out to be the worst storm to hit Southeast England in three centuries. Nowadays, of course, weather forecasting is based on phenomenally sophisticated mathematical models that churn through eye watering amounts of data.
我们拥有全球最强大的超级计算机之一,用于分析大气物理现象以进行天气预报。存放这些超级计算机的大厅有足球场那么大。
We have one of the world's most powerful supercomputers for analyzing the physics of what's going on in the atmosphere to make our weather forecasts. The halls where those supercomputers exist are football pitch size.
这些模型确实存在局限性。
These models do have their limitations.
传统天气预报方法在预测未来几小时到几天内的天气时表现尤为出色。但很多决策需要在比这更短的时间尺度上做出。
So traditional weather forecasting approaches have a real sweet spot about a couple of hours in the future to maybe a few days in the future. But a lot of decisions need to be made on a shorter time scale than that.
这种短期天气预报被称为'临近预报'。
This shorter term weather forecasting is known as nowcasting.
临近预报要解决的问题是预测未来短时间内哪里会下雨以及雨量大小。
Nowcasting is the problem of predicting where is it going to rain and how much just a short window into the future.
这是DeepMind的Raya Hadzall的声音。
This is the voice of DeepMind's Raya Hadzall.
所以我们讨论的是,未来30分钟到几小时内我家附近是否会下雨,并以相当高的分辨率预测降雨区域?
So we're talking just, is it going to rain over my house in the next thirty minutes up to a couple of hours into the future, and predicting at a pretty high resolution where is it going to rain?
能够预知接下来一小时的情况有哪些实际好处?
What are the real benefits of being able to know what's going to happen in the next hour?
我们的理想是在极端洪水事件发生前预警民众,让他们能采取疏散等行动。过去几年英国就有几个显著案例,比如西南部博斯卡斯尔和科弗拉克的洪水事件。
The dream here is to be able to warn people before really extreme flooding events so that they can take action like evacuation. There's been a few notable examples over the last few years of these really extreme rain events in The UK, the flooding at Boscastle and Coverrack in the Southwest.
正是这些画面——车辆像软木塞一样在水中漂浮颠簸,
It is these pictures now with the the vehicles bobbing around and then just just floating along like corks, which
事实上在气候变化背景下,我们相当确定未来会更频繁发生的现象之一就是降雨将变得更加极端。
And actually under climate change, one of the things that we're reasonably confident is going to happen more in the future is that rainfall is going to become more extreme.
问题在于,传统基于物理的预测模型需要如此庞大的计算量,以至于需要足球场大小的超级计算机来处理,等到预测结果出来时往往已经过时。因此研究人员转而采用其他统计方法进行短期预测,包括一种称为光流法的技术——这种1940年代开发的计算机视觉方法,能追踪二维图像上气流的运动轨迹。
The problem here is that the traditional physics based forecasting models involve so much number crunching inside that football stadium sized supercomputer that by the time their forecast is ready, it's already out of date. So researchers use other statistical methods for their short term forecasts, including a technique called optical flow, a computer vision method developed in the nineteen forties, which tracks the movement of air over a two dimensional image.
该方法会分析云层和降水的当前状态,然后通过追踪流线来推测云团未来的移动方向。作为初始方案虽无不妥,但这只是对问题的一阶近似。
It looks at the current state of clouds and precipitation, and then it tries to follow those streamlines to kind of extrapolate where it thinks those clouds are going to go in the future. It's not an unreasonable place to start, but it's quite a sort of first order approximation of the problem.
后来某天,雷亚·哈泽尔在埃克塞特偶遇气象局人员时,突然意识到屏幕上云团沿特定方向移动的描述似曾相识——这与深度学习领域一个经典难题惊人地相似。
And then one day, Raya Hadzell was at a chance meeting in Exeter, Exeter, chatting to some people from the Met Office, when she realized that this description of clouds moving in a particular direction across a screen rang a bell. It was startlingly similar to a well trodden problem in deep learning.
视频预测是通过分析现有视频帧来推测后续画面的研究领域。就像看到板球挥杆动作时,我能预判球棒将继续摆动。同理,我们可以将降雨视为随时间播放的视频——雷达数据在英国地图上形成动态信息层,显示雨带移动、风暴形成或消散的过程。于是我们想到用视频预测神经网络来解决短期降水预测问题。
Video prediction is an area of research where you take a video and then you just try to predict what the next few frames in that video are going to be. So if I see somebody swinging a cricket bat and then you stop that for a moment, I could sort of say, ah, what's gonna happen next is that that cricket bat is gonna continue to swing through. And you can think about rainfall as being a video that's playing over time where the radar provides this information layer over a map of, say, The UK. As the rain moves along, maybe a storm comes up or a storm dissipates. And so we thought that doing that short term prediction into the future could be solved by using video prediction neural networks.
但在神经网络能用于降水预测前,需要先进行训练。为此,英国气象局动用了降雨雷达系统——这套设备通过电磁脉冲测量降雨位置和强度。
But before the neural network could be used to predict precipitation, it needed to be trained. For that, the Met Office had their rainfall radar, a set of instruments which use electromagnetic pulses to measure the location and intensity of rainfall.
我们获取了英国境内约一年的雷达数据,将其转换成类似电影的视频序列。通过训练不同架构的模型来预测后续画面,初期效果尚可,但神经网络往往只能生成模糊的未来帧。于是我们开始探索其他解决方案,最终发现生成对抗网络效果极佳。
We got about a year's worth of radar data across The UK and turned this into something that looked like a movie, like a video playing. And we started training different types of architectures to just predict the next few frames of video. And this worked alright, but what tended to happen is that the neural network just predicted a blurred out future. So we started looking at other methods to solve this. And the method that has worked extremely well is to use a generative adversarial network.
这种方法更多见于深度伪造的讨论,因为它能制作出以假乱真的视频,确实是令人担忧的AI技术应用。所以看到GANs(生成对抗网络)用于这个领域时,反而让人倍感欣慰。
This is usually talked about more in the context of deep fakes, because this is a method that can be used to produce extremely realistic fake videos. And this has been a really worrisome use of AI technology. And so it was actually really nice to see that this was an application of GANs.
GAN(生成对抗网络)是一种让两个神经网络相互竞争以生成最逼真图像的巧妙方法。这就像伪造者与警察的配对:伪造者试图制作能骗过警察的图像,如果不够逼真就会被识破并重试。随着时间的推移,这种竞争逐渐提高了图像的准确性。在本案例中,这些图像是对近期天气的预测。
A GAN, or generative adversarial network, is a clever way of having two neural networks compete with each other to produce the most realistic images. It's as though you have a pairing of a counterfeiter and a police officer. The counterfeiter tries to produce an image that will fool the police officer, and if it's not good enough, they'll get caught and have to try again. Over time, that competition gradually increases the accuracy of those images. In this case, those images are predictions of weather in the near future.
使用这项技术得到的结果令人震惊。
And using this technique, the results were startling.
它没有生成模糊的雨区,而是产生了非常清晰的降雨线,并逼真模拟了风暴在英国境内的移动轨迹。
Instead of producing blurred out fields of rain, it produced very crisp lines of rain and realistic movements of storms across The UK.
为了准确测试这种AI临近预报相较于光流法的优势,研究人员输入了苏格兰上空的降水雷达图像,要求神经网络预测未来90分钟的降雨模式。他们将预测结果与实际天气观测数据进行对比。
To test out exactly how good this AI now casting was compared to the optical flow method, researchers fed in a radar image of precipitation patterns over Scotland and asked the neural network to generate predictions of what the rain pattern would look like over the next ninety minutes. They compared these predictions to observations of how the actual weather turned out.
当我第一次看到这些图像时,我完全分不清哪些是观测数据,哪些是预测结果。
When I first saw these images, it was unclear to me which ones were the observations and which ones were the predictions.
我当时问:这些是相同的图像吗?它们相似得惊人,简直不可思议。
I said, are these the same images? They were so close. It was remarkable.
但你以为是有人搞混了,把同一张图片给了你两次。
But you thought someone had got mixed up and just given you the same picture twice.
它们看起来非常相似。虽然不完美,但非常逼真。
They looked very similar. It wasn't perfect, but it was very realistic.
这些云的结构最终成为预测降雨强度、时间和地点的重要指标。一旦降水落地,另一种模型就会接管,计算水流如何沿山坡下泄并在山谷汇集,可能引发洪水。
The structure of these clouds ends up being an important predictor of exactly how heavy rainfall will be, where and when. And once that precipitation hits the ground, a different type of model takes over, working out how water will run down hills and collect in valleys, potentially causing flooding.
这种外推系统的优势之一在于,其输出结果对洪水模型更有价值,因为实际预测具有更精确的细微结构,这意味着当数据输入洪水模型时,有望产生更准确的洪水预测。虽然这套系统目前尚未完全达到理想状态,但确实让我们又向前迈进了一步。
One of the advantages of an outcasting system like this is that it could mean that the output is more useful for those flooding models because the actual predictions that make have this more accurate fine structure, which means that when it goes into a flooding model, hopefully, it could lead on to more accurate flooding predictions. We're not necessarily quite there yet with this system, but it certainly has moved us another step along.
GAN模型不仅能提供单一预测,还能生成多种对未来可能情况的估计。
The GAN model doesn't just provide one prediction. It can provide many different estimates of what's going to happen in the future.
通过分析这些不同的可能性,我们可以了解各种极端情景的范围,这对于帮助人们做出平衡决策极具价值。
And by inspecting those different possibilities, can get an understanding of what the different extremes of the scenarios are, which is really valuable when we're trying to help people make balanced decisions about what they're going to do.
需要做出决策的是英国气象局的 meteorologists(气象学家)。他们负责评估所有可用信息并制定最终预报。尼尔通过调查了解他们是否更倾向于使用AI工具而非传统方法。
The people who need to make the decisions are Met Office meteorologists. They are the ones who assess all of the information available and construct the final forecast. Neil surveyed them to find out whether they preferred using the AI tool to traditional methods.
他们确实经常选择这种新的深度学习方法而非传统方法,这是一个非常积极的信号。
They really regularly chose this new deep learning methodology over the traditional methodology, which is a really good sign.
临近预报项目是人工智能应用于天气预报领域的第一步尝试,但仍有许多重要挑战需要解决。例如,由于这些机器学习模型基于历史数据,它们难以准确预测真正罕见的极端天气事件。
The Nowcasting project represents a first step in how AI could be used in weather forecasting. But there are still important challenges to iron out. For instance, because these machine learning models are based on what has gone before, they're not good at forecasting really unusual extreme weather events.
作为预报员,事件越罕见,我们越有兴趣预测它。这正是传统天气预报方式的伟大之处。我认为这也是为什么在气象学家看来,深度学习永远不会完全取代基于物理的模型。实际上,我认为未来真正的方向是混合方法——将物理知识与深度学习方法的力量相结合。
As forecasters, the more rare an event is, the more interested we are in forecasting it. And that's one of the great things about the traditional way we do weather forecasting. I think it's also why, in the view of meteorologists, deep learning is never going to replace the physics based models. I actually think the future is really for a hybrid approach where we're able to take the physical knowledge and combine that with the power of deep learning methodologies.
由于这些局限性以及神经网络无法详细解释所有预测结果,它们尚未被纳入英国气象局的官方预报系统。但与DeepMind的合作让我们得以窥见未来:人工智能技术将增强专业气象学家的能力。当然,预测能力对解决各类现实问题都很有用。但在这个以精彩不可预测性风靡全球的足球运动中,AI表现如何?去年DeepMind与英国利物浦足球俱乐部合作发表了关于AI如何变革足球的论文。
Because of these limitations and the fact that neural networks cannot explain all of their predictions in detail, they aren't yet being incorporated into the Met Office's official forecasts. But the collaboration with DeepMind has provided a glimpse of a future in which artificial intelligence technologies augment the capabilities of trained meteorologists. Of course, being able to make predictions is useful for all kinds of real world problems. But how does AI fare in a game that is adored around the world for its glorious unpredictability? Last year, DeepMind published a paper on how AI could transform football in collaboration with Liverpool Football Club here in The UK.
你问为什么选择利物浦?那你最喜欢的足球队是哪支?
Why Liverpool, you ask? Well, who's your favorite football team?
利物浦。我热爱利物浦队,他们的每场比赛我都看。
Liverpool. I love Liverpool. I watch every one of their matches.
原来DeepMind的首席执行官德米斯·哈萨比斯是红军利物浦的终身球迷,这谁能想到呢?
Turns out that DeepMind CEO, Demer Sesabes, is a lifelong fan of the Reds. Now who would have thought it?
我知道没人会相信,但确实是他们主动联系了我们。我们当然立即抓住了这个机会——何况他们目前拥有全球体育界最顶尖的分析团队之一。作为合作的一部分,我们还参观了训练基地,这显然是必要的。
I know no one will believe this, but they approached us. Of course, we jumped at the chance, and they happen to have one of the best analytics teams in the world of sport currently. Of course, we got a tour of the training ground, which we needed, obviously, to have as part of the collaboration.
那天你有空吗?
Were you free that day, though?
我那天碰巧有空,真是奇迹。
I was happened to be free that day, miraculously.
当然,通过数据分析足球比赛并非新鲜事。但近年来真正的变化在于可用数据的海量增长——从监测球员位置的计算机视觉算法,到捕捉球员动作的运动传感器。这篇足球论文的作者之一卡尔·特尔斯任职于DeepMind巴黎办公室。未来五年,这项足球研究的重要目标之一,是开发名为'自动视频助理教练'(简称AVAC)的AI系统原型。
Of course, crunching data to analyze a game like football is nothing new. What has changed in recent years though is the sheer amount of data available. Everything from computer vision algorithms monitoring players' positions and motion sensors picking up on players' movements. Carl Terls, one of the authors on the football paper, is based at DeepMind's Paris office. Over the next five years, one of the big ambitions of this football work is to build a prototype of an AI system known as an automated video assistant coach, or AVAC for short.
这套系统能无缝整合多种数据模式,包括原始视频素材、追踪数据、事件流数据,以及球员佩戴的各种传感器,从而辅助教练决策。
This is basically a system that seamlessly integrates several data modalities, like raw video footage, tracking data, event stream data, all sorts of sensors that the players are wearing to assist coaches with their decision making.
这里涉及多种关键技术:计算机视觉能解析比赛画面中的动态;博弈论专注于最大化对阵优势;统计学习方法则可挖掘历史比赛中的规律。将这些技术整合后,自动化教练就能进行反事实推演——预测特定战术调整或球员替换可能带来的比赛变化。
There are a few different techniques that are useful here. There's computer vision, which can detect what's going on in footage from a football game. Then there's game theory, which is all about maximizing your advantage over an opponent. And then there are statistical learning methods, can hunt for patterns in previous games. Put them together, and this automated coach could make counterfactual predictions of what would happen in the game if a particular tactical change is made or a certain player is replaced.
假设利物浦队在英超关键战中遭遇宿敌曼城,教练就能利用AI系统实时监测比赛并提供战术反馈。
Say for argument's sake that Liverpool FC are up against arch rivals Manchester City in a big Premier League game. Liverpool's coach could use the AI system to monitor the match and provide tactical feedback in real time.
教练可以这样询问系统:'嘿EVAC,如果把萨拉赫从边锋调到中锋位置会怎样?或者让法比尼奥从后卫改打中场?'这类反事实推演对教练极具价值——我们能根据上半场表现来模拟这些调整吗?
So a coach might say to the system, Hey, Evac, what will happen if I move Salah from a winger position to a striker position? Or we would move Fabinho from defense to midfield. So sort of these counterfactual questions that are really interesting for a football coach, can we play that out based on what we've seen in the first half?
教练可以观看一个简化的模拟视频,其中移动的点代表不同情景下球员的可能轨迹。这里的理念并非取代人类分析师,而是为他们提供另一种强大的分析工具。
The coach could then be shown a simplified simulation, a video with dots moving across the pitch to indicate possible player trajectories in different scenarios. The idea here is not to replace human analysts, but to complement them with another powerful analytical tool.
AVAC系统仅会提供建议,并给出它认为可能合适的行动方案。但最终决定权仍在教练手中,当然球员们也需要根据建议采取行动。
The AVAC is just going to give advice and is going to say what it believes is maybe a good action to take. And in the end, it's up to the coach and it's of course also still up to the players to act upon that.
这类系统的价值不仅限于比赛期间。它还能在赛后训练中发挥作用,精确指出球员选择射门而非传球的关键时刻。虽然DeepMind目前的研究聚焦于为教练和球队开发新型分析工具,但卡尔·图尔斯认为AI也有提升足球迷观赛体验的潜力。
It's not just during a game that such a system could be useful. It could help in post match training too, highlighting the exact moment when it would have been better for a player to pass rather than taking a shot at goal. Although DeepMind's research is currently focused on new analytics tools for coaches and teams, Karl Tulls believes there are also ways in which AI could enhance the experience of football fans.
目前电视观众只能听到专业解说员的评论。但借助新技术,解说内容可以根据个人兴趣实现定制化——比如你可以向AI询问关于比赛战术的问题。在更远的未来,球迷或许能通过球场屏幕或VR设备获得增强现实体验,例如感受球场氛围。
Currently, when a fan watches a game on TV, there will be, like, expert commentary. But with new technology, this could become personalized expert, depending on your own interests, maybe what sort of questions you would ask the AI about your game, for example on tactics. Maybe in a more distant future fans will have access to a screen in the stadium or VR that augments their experience. So for example, getting a feel for the pitch.
不难想象,更个性化的球迷体验和更精准的球队预测将对未来足球产生深远影响。但正如我之前所说,足球的魅力恰恰在于其难以预测性。
It's easy to see how a more personalised experience for fans and improved predictions for teams could have an impact on football in future. But as I said before, football is a joyously unpredictable game.
我认为我们永远无法准确预测比赛结果。这本质上是因为教练的场外决策和球员的场上判断仍掌握在人类手中,对吧?所以数据信号始终存在噪声。
I don't think we will be able to predict outcomes of a game accurately at any point in time. And this is simply because the decision making off pitch by coaches and on pitch by the players is still in hands of humans, right? So the signal is still noisy.
过度依赖AI来解读这项充满人性化的运动也存在问题。2020年苏格兰因弗内斯足球俱乐部曾尝试用自动追踪足球的摄像机直播比赛,结果在某场比赛中,摄像机竟把边裁的光头误认为足球持续跟拍。这类计算机视觉系统对不同球员的追踪能力差异也引发了担忧。
And there are problems with relying too much on AI for what is at heart a deeply human game. In 2020, the Scottish football team Inverness Caledonian Thistle FC announced that it would live stream its games via cameras, which automatically track the football to give viewers the best view of the action. During one game, the automatic camera seemed much more interested in following the linesman around the pitch. Turned out, it had mistaken the linesman's bald head for a football. And there are concerns that computer vision systems like these might be much better at tracking some players than others.
现有系统对女子体育赛事的捕捉能力远不如男子赛事。
The current systems don't capture the events that happen in women's sports as well as they do for men's sports.
这是杰克逊·布罗希尔,足球论文的另一位作者。
Here's Jackson Brochier, another author on the football paper.
因此,即使我们试图以与男子数据同等的方式对女子数据进行前瞻性研究,用于训练的视频中标识动作的标签实际上准确度要低得多。
So even where we're trying to do proactive research on women's data in an equal way to men's data, the labels that identify what's happening in the videos that we use for the training are actually much less accurate.
对了解AI偏见问题的人来说,这听起来可能是个熟悉的故事。为了真正擅长分析足球运动员和球队的表现,AI系统需要观看数百小时的足球比赛录像。所有这些数据都需要人工标注,告诉系统每一帧发生了什么。问题在于女子足球领域,可用于训练的标注数据量远远不足。你可能会疑惑为什么足球运动员的性别在这里会有关联。
To those clued up about the problem of bias in AI, this might sound like a familiar story. In order to get really good at analyzing the performance of football players and teams, an AI system would need to watch hundreds of hours of video footage from football matches. All of this data then needs to be manually annotated to tell the system what is going on in each frame. The trouble is when it comes to women's football, there is not nearly as much annotated data to train on. You might be wondering why would the gender of a football player even be relevant here.
但正如我们在其他众多AI系统中看到的,有时即便是微小差异——比如运动员的体型——都可能导致AI对这些比赛的预测准确度下降。这里存在一个更广泛的问题:目前AI系统的能力完全取决于其训练数据,这意味着如果数据集中缺失某个群体,其影响可能是巨大的。
But as we've seen in numerous other AI systems, sometimes even small differences, perhaps the body frame of the players could be enough to mean that the AI's predictions on those games would end up being less accurate. There is a broader point here. Currently, AI systems are only as good as the data they're trained on, which means if a particular group is missing from your dataset, the implications can be huge.
当首次发布面部解锁手机时,如果训练算法使用的图像中白皮肤人群多于黑皮肤人群,那么系统就会更擅长识别前者。我们在体育领域要做的,就是确保开发的解决方案不会因训练数据量的差异而对性别、肤色或其他特征产生偏见。
When they first released phones that would unlock from your face, the images they used to train those algorithms, if they were more people of white skin versus black skin, then it learned how to identify those faces better. What we want to do on the sports side is make sure that the solutions that we're developing are not biased to gender or skin tone or any other variances in the volume of data that we're learning from.
研究人员目前正在考虑技术方案来解决女子足球数据不足的问题。但这些只是针对AI系统偏见这个更大问题的特定修补。尽管这里的工作人员坚信AI在现实世界中部署的益处,但新技术也可能带来需要谨慎应对的不良后果。在下一期DeepMind播客中,我们将深入探讨DeepMind为确保AI进入现实世界时能普惠所有人所做的努力。
Researchers are currently considering technical solutions to address the lack of data from women's football. But these are specific fixes to a much larger problem of bias in AI systems. And as much as those working here believe firmly in the benefits of deploying AI in the real world, there are also potentially unwelcome consequences to new technologies that have to be carefully navigated. In the next episode of the DeepMind Podcast, we'll be taking a closer look at the efforts at DeepMind to make sure that when AI reaches the real world, it works for everyone.
我们知道历史上的某些时期曾对特定群体造成伤害。对吧?如果我们以这种视角审视现代科技,就会发现AI的某些应用呈现出非常相似的模式。
We know that periods of history have caused harms to specific communities. Right? And if we look at modern technology through that lens, we see very similar patterns and certain uses of AI.
本期节目就到这里。我得赶紧走了,因为天气预报说快要下雨了。片尾字幕就交给我的WaveNet语音来念吧?《DeepMind》播客由汉娜·弗莱主持。特别感谢本期嘉宾——DeepMind工程师诺曼·卡萨格兰德,他抽空为汉娜制作了WaveNet语音。
And that is all for this episode. Gotta run because the forecast says rain's on the way. I'll leave the credits to my WaveNet voice, shall I? DeepMind, the podcast, is presented by Hannah Fry. Special thanks for this episode go to Norman Casagrande, the engineer at DeepMind who found the time to create Hannah's wavenet voice.
本系列节目由Whistledown Productions的丹·哈东担任制片人。下次节目再见,再见。
The series producer is Dan Hardoon of Whistledown Productions. Until next time, goodbye.
关于 Bayt 播客
Bayt 提供中文+原文双语音频和字幕,帮助你打破语言障碍,轻松听懂全球优质播客。