AI Voice Actors Sound More Human Than Ever

2023-04-15 11:40By

英语世界 2023年2期

扫码听读

A new wave of startups are using deep learning to build synthetic voice actors for digital assistants, videogame characters, and corporate videos.

2The company blog post drips with the enthusiasm of a ’90s US infomercial1infomercial 商业信息电视片，专题广告片。. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured,and professional.”

3Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.

4WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.

新一波的初创公司正在运用深度学习技术为数字助理、视频游戏角色和企业视频合成虚拟配音演员。

2WellSaid Labs 公司的博客文章字里行间充溢着90 年代美国专题广告片的热情，描述了其“八位新数字配音演员”能带给客户的效果。托宾“精力充沛、洞察力强”；佩姬“沉着而富有表现力”；阿娃“优雅、自信、专业”。

3每个数字配音演员都有一位真人配音演员作原型，（经后者同意）利用AI 技术保留相似度。如今，公司可以授权这些声音按需说话，只要将一些文本输入语音引擎，就会输出一个清晰的音频剪辑，播放着听起来自然的表演。

4WellSaid Labs 是一家初创公司，总部位于西雅图，从非营利性研究组织艾伦人工智能研究所中分离出来，新近开始为客户提供A I语音。目前，它专注于企业电子学习视频的声音。其他初创公司的业务涉及为数字助理、呼叫中心运营商甚至视频游戏角色配音。

5Not too long ago, such deepfake2deepfake 是深度学习（deep learning）与fake（伪造）的合成词，指基于深度学习等机器学习方法创建或合成视听觉内容，如图像、音视频、文本等。深伪技术最广为人知的一种应用形式是AI 换脸（face-swap）。voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.

6AI voices are also cheap, scalable3scalable（系统）可扩增的；可增大的。,and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.

How to fake a voice

7Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa,simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.

5不久前，这种深伪技术合成的声音用于诈骗电话和互联网骗术，因而名声不佳。但此后，它们的质量持续提升，激发了越来越多公司的兴趣。最近，深度学习的技术突破使复制人类语言的许多微妙之处成为可能。这些声音的停顿、呼吸都恰到好处，还能改变自己的风格或情感。如果它们长时间说话，你就能发现端倪，然而在简短的音频剪辑中，有些合成声音已经与真人声音难以区分。

6此外，AI 语音造价低、可扩展且易于使用。合成声音与真人配音演员的录音不同，它们还能实时更新脚本，为个性化广告开辟了新机会。

如何伪造声音

7合成声音已经存在了一段时间。但是，包括原始S i r i和A l e x a在内的老版声音只是简单地将单词和声音黏合在一起，听着笨拙，如同机器人。如要让它们听起来更自然，就需要人工作业，颇为费劲。

8Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead,they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.

9Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deeplearning models. The first predicts, from a passage of text, the broad strokes4stroke = brushstroke（计划或想法的）阐释方式。of what a speaker will sound like—including accent, pitch, and timbre5timbre 音质，音色。. The second fills in the details, including breaths and the way the voice resonates in its environment.

10Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.

11Capturing these nuances involves finding the right voice actors to supply the appropriate training data and finetune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.

8深度学习改变了这一点。语音开发人员无须再规定生成语音的确切节奏、发音或语调。他们可以将几个小时的音频输入算法，让算法自主学习这些模式。

9多年来，研究人员利用这一基本理念构建日趋复杂的语音引擎。例如，WellSaid Labs构建的一个语音引擎就使用了两个主要的深度学习模型。第一个模型是从一段文字中预测说话者听起来大致是什么样子——包括口音、音高和音色。第二个模型填充细节，包括呼吸和声音在其环境中的回音。

10然而，要想合成声音以假乱真，不能仅凭按一下按钮。真人声音之所以听起来像真的，部分原因是它并非一成不变，表现力强，有能力根据语境以截然不同的风格演绎出相同的台词。

11要想捕捉这些细微差别，就要找到合适的配音演员提供适当的训练数据，还要微调深度学习模型。WellSaid 说，如果要开发一个逼真的合成复制品，至少需要一两个小时的音频和几周的劳动。

12AIvoiceshavegrownparticularly popular among brandslooking tomaintainaconsistentsound inmillionsof interactionswithcustomers.Withthe ubiquity of smartspeakerstoday,and theriseof automatedcustomerserviceagentsaswellasdigital assistants embeddedincarsandsmartdevices,brandsmay need toproduceupwardsof a hundred hoursof audioa month.But they alsonolonger want touse thegenericvoicesoffered by traditional textto-speech technology—a trend that accelerated during thepandemicasmore andmorecustomersskippedin-store interactionstoengagewith companies virtually.

13“If I’m Pizza Hut,Icertainly can’t soundlikeDomino’s,andIcertainly can’t sound like Papa John’s,”saysRupalPatel,aprofessor at Northeastern UniversityandthefounderandCEO of VocaliD,whichpromisestobuild custom voicesthat match a company’s brandidentity.“Thesebrandshave thoughtabouttheircolors.They’ve thought about their fonts.Now they’ve got tostart thinking about the way their voice soundsaswell.”

14 Whereas companies used to have to hire different voice actors for different markets—the Northeast versus Southern US, or France versus Mexico—some voice AI firms can manipulate the accent or switch the language of a single voice in different ways. This opens up the pos-sibility of adapting ads on streaming platforms depending on who is listening, changing not just the characteristics of the voice but also the words being spo-ken. A beer ad could tell a listener to stop by a different pub depending on whether it’s playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it’s already working with clients to launch such personalized audio ads on Spotify and Pandora.

12想要在与客户的数百万次互动中保持始终如一声音的品牌格外青睐AI语音。随着当今智能扬声器的普及，随着自动化客户服务代理以及车载和智能设备内置数字助理的兴起，各大品牌每月可能需要制作超过一百小时的音频。但是，它们不再愿意使用传统的文本转语音技术所提供的通用语音——这一趋势在疫情期间加速发展，越来越多的顾客放弃到店购物，转而与公司进行虚拟互动。

13“如果我是必胜客，我肯定不能听起来像达美乐或是棒约翰。”东北大学教授、VocaliD创始人兼首席执行官鲁帕尔·帕特尔说。他的公司承诺提供与公司品牌特性相匹配的定制声音。“这些品牌已经考虑过它们的颜色、字体，现在也开始考虑它们的声音风格。”

14曾经，各大公司必须为不同的市场（比如美国东北部与南部、法国与墨西哥）雇用不同的配音演员。如今一些语音AI 公司能够以不同方式对言这就可以根据听来调整流体平台上的广告，不仅能改声音特征，还能改变措辞。例如，啤酒广告可以针对其不同的播放地区，如纽的酒吧。告智助计语音的Resembl.ai 表示，它已经在与客户合作将在Spotify 和Padora 上推出这种个性化的音频广告。

15But there are limitations to how far AI can go. It’s still difficult to maintain the realism of a voice over the long stretches of time that might be required for an audiobook or podcast.And there’s little ability to control an AI voice’s performance in the same way a director can guide a human performer.

A human touch

16In other words, human voice actors aren’t going away just yet. Expressive,creative, and long-form projects are still best done by humans. And for every synthetic voice made by these companies, a voice actor also needs to supply the original training data.

15但是AI 的应用前景有其局限性。有声读物或播客都需要长时间播放，而要在这么长的一段时间内保持声音的真实感仍然是一件困难的事情。此外，像导演指导人类表演者那样掌控AI 语音的表演几乎无法做到。

一抹人情味

16换句话说，人类配音演员还没到离场的时候。表现力强、富于创意和长篇的项目还是人类做得最好，而且上述公司制作的每一个合成声音都需要配音演员提供原始的训练数据。

17 在VocaliD 的帕特尔看来，AI语音的最终目的并不是复制人类的表现或运用自动化技术取代现有的配音工作。它们的前途在于有望开辟全新的可能性。她说，如果将来可以使用合成语音快速调整在线教育材料以适应不同的受众，那会怎样？“打个比方，假设你要把材料推广给市中心贫困区的孩子，如果那个声音听起来真的像是来自他们的社区，难道不是很好吗？”

17For VocaliD’s Patel, the point of AI voices is ultimately not to replicate human performance or to automate away existing voice-over work. Instead, the promise is that they could open up entirely new possibilities. What if in the future,she says, synthetic voices could be used to rapidly adapt online educational materials to different audiences? “If you’re trying to reach6reach 理解；与……交流。, let’s say, an inner-city77 inner-city 市中心贫民区的。group of kids, wouldn’t it be great if that voice actually sounded like it was from their community?” ■