Deepfake Voice Technology: The Good. The Bad. The Future

modeless · on Feb 8, 2021

I have yet to hear any example of good quality voice deepfakes. Every example I've heard so far was completely unconvincing, and absolute trash compared to human impersonators. Speaking of, human voice impersonation has been possible and quite convincing for forever.

Audio deepfakes, when they finally arrive for real, will simply reduce the effort required for voice impersonation. They do not make possible new feats of impersonation that were never possible before. They can still have an effect on society for sure, but I believe the concern is out of proportion with the effect.

ansk · on Feb 8, 2021

If you look at the latest research in neural network based approaches to text-to-speech, you'll find plenty of examples of realistic synthetic voices. However, due to the nature of the available training data (audiobooks), these models 1) are restricted to reading text in a narrative style and 2) aren't generally trained on speech of recognizable public figures. There was a paper a while back that was trained on TED talks and produced fairly convincing voices, some of which you may recognize (Bill gates, Stephen Wolfram and some others - see here https://sjvasquez.github.io/blog/melnet/) Definitely not 100% convincing, but I wouldn't call it absolute trash.

spyder · on Feb 8, 2021

It was already successfully used to "Scam A CEO Out Of $243,000":

https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...

Probably it' easier because phone calls aren't high quality.

arnaudsm · on Feb 8, 2021

How did they prove the voice was an AI and not a human impersonator ?

tyingq · on Feb 8, 2021

They didn't, it was all speculation, mostly speculation from the person that received the call and the company's insurance firm.

sillysaurusx · on Feb 8, 2021

I made Dr Kleiner sing "I am the very model of a modern major general": https://www.youtube.com/watch?v=koU3L7WBz_s&ab_channel=Shawn...

It's pretty trash, but it's not completely horrible. It's also over a year old, so the SOTA is likely better today.

chrisco255 · on Feb 8, 2021

The effort will be reduced to typing sentences and selecting emotions from a dropdown menu. That opens up a whole new can of worms for a society that already has trust issues.

exporectomy · on Feb 8, 2021

Is that your own society? How would it exacerbate your trust issues? Surely making people doubt untrustworthy things is good?

garg · on Feb 8, 2021

Making people doubt untrustworthy things by educating them through a proper process is good. Exposing an unsuspecting/uneducated population to manipulated content designed to deceive/fool them for nefarious purposes is not an ideal process if the goal is to make the general public have better judgement.

This is a global issue. For examples, please look into the societal concerns surrounding 'deep fakes', and 'fake news' for answers to your first two questions.

exporectomy · on Feb 10, 2021

Pitting people in a social game of who can be fooled the easiest sounds like a more effective approach than reading a dry textbook at them. There's shame in being seen as gullible. This happens already in casual conversation and people already have opportunities to believe false information from all over the internet or in books or tabloid newspapers or told to them by their peers or parents.

I don't believe the concerns about deep fakes and fake news. I think they're no worse than what we've always had. Those fears are exaggerated by arrogant hand-wringers who somehow assume their own beliefs are exempt from manipulation and it's only intellectually inferior "other" people who are susceptible. They're so fixated on the popular political ideas that they forget they should be on a crusade against religion which is the elephant in the room when it comes to people being misled.

SubiculumCode · on Feb 8, 2021

Do not underestimate the power of mediocre at scale.

Synthetic is scalable.

egfx · on Feb 8, 2021

Check out the demos at https://www.descript.com/overdub

I wouldn’t call the voices unconvincing. They used to have a better demo though where it would impersonate celebrities like Trump. It was very interesting to demo it.

dbspin · on Feb 8, 2021

Obviously it's subjective, but all of these demos have the 'watery' digital sound that's an immediate giveaway. By definition they're unconvincing.

kenoph · on Feb 8, 2021

I am using descript for a project of mine. You can definitely still hear that they are AI-generated. But I'm sure it's not gonna take long. Couple this with SSML and it's going to be quite convincing.

chrisin2d · on Feb 8, 2021

> They can still have an effect on society for sure, but I believe the concern is out of proportion with the effect.

I believe the contrary: that high-quality, convincing audio (and video) deepfakes will be cataclysmic for society. Voice audio is one of the last digital artifacts that people will believe with their senses.

When Donald Trump’s “grab them by the p—“ audioclip came out, he was at least forced to acknowledge it.

Once high-quality audio deepfake tools become available, amateurs will be able to flood the web with fake audioclips of politicians saying things they never did, and truly unscrupulous politicians will be given technological cover for any damning audioclips of theirs that get leaked.

Once we cannot trust our senses, then we will be only able to trust our institutions and then our tribes. But trust in institutions is already decaying rapidly, so in the end there only ideological tribes will be left.

gotrythis · on Feb 8, 2021

Did you check out the company linked in the article?

https://www.respeecher.com/

bryanrasmussen · on Feb 8, 2021

automating sending out fake emails allowed any idiot to do it.

there is a difference in scale and effect between every idiot in the world can fake as many impersonations of people that they want (and they can coordinate this impersonation with 100s of others who have a goal with this impersonation) and Rich Little calling up somebody's granny and claiming to be Jimmy Stewart.

BikiniPrince · on Feb 8, 2021

The Jordan Peterson df was quite good. I made a few sound bytes to send to some friends as a gag, but they were already aware it was in the wild. The compression is probably the worst here, but it was fun when it was up. https://www.notjordanpeterson.com/

falcolas · on Feb 8, 2021

There’s a whole other avenue that’s not covered here: theft of voice actors voices. It was covered by a voice actor on YouTube, as a warning for other actors. The first is only two minutes or so, the second is a bit over ten, but goes into more detail.

https://youtu.be/p5onpLkhq50

https://youtu.be/vr7YbfzSd5A

skocznymroczny · on Feb 8, 2021

Polish game modders for the game Gothic (very popular RPG in Poland) are using "AI" to reproduce the voices of the Polish voice actors who did the original dub of the game. The quality isn't there yet, the voice clarity is all over the place, but it's already a much better option than using a generic "Microsoft Sam" kind of voice. And when technology matures, it will allow modders to have in-game characters sound as they "should sound like", instead of being silent or resorting to amateur dubbing as it is now.

onethought · on Feb 8, 2021

It's an interesting question, copyright doesn't cover tone/pitch/etc... so how do you protect that? Should you even be allowed to?

tyingq · on Feb 8, 2021

Bette Midler won her lawsuit against Ford Motor Co: https://en.wikipedia.org/wiki/Midler_v._Ford_Motor_Co.

onethought · on Feb 8, 2021

So that only worked because Bette Midler was very well known AND the main reason for using the impersonation was to get that recognition.

So you could still deep fake voice actors with the intention of just saving money rather than wanting people to recognise them.

jonathanstrange · on Feb 8, 2021

It may be a grey zone now but I find it likely that judges will decide that copyright does cover the sound of someone's voice and other aspects (intonation, accent), as these are vital aspect of the performance and often the reason why a voice actor gets hired.

I'd go much farther than that. Unique features of someone's voice and appearance (face, walk) should not only be protected by copyright but also by privacy laws. It would be crazy if producers were allowed to make arbitrary videos with anyone's appearance and voice, putting words in your mouth you'd never endorse.

Luckily, this may be one of the cases when politics will react fast and swiftly, if necessary with new laws, since politicians would likely be among the first victims of the new trend.

dave_sullivan · on Feb 8, 2021

As someone that wants to abolish patents and limit copyright, I'd say no. Others without those leanings will no doubt disagree.

onethought · on Feb 8, 2021

So if you limit copyright, how do you intend to monetise original expression? Or you don’t agree such things should be monetised

dave_sullivan · on Feb 8, 2021

They are free to make money during the term of the copyright, but that term should be reduced by a lot. And I don't think the definition of copyright should be extended to include someone's voice, style, appearance, recipes, or others that I'm leaving out.

onethought · on Feb 8, 2021

Is it an extension? (I know technically it is). Voice could be captured as accent/pitch/tone... I guess it’s a problem as there would be too much natural overlap, so not really a novel expression.

TeMPOraL · on Feb 8, 2021

Maybe voice would be better suited to trademark protection? After all, it isn't really about accent/pitch/tone, but about being sure that if something sounds like it's voiced by $VoiceActor, it's actually voiced by $VoiceActor (and $VoiceActor gets their cut of royalties/recognition).

lloda · on Feb 8, 2021

There's probably a trademark / misrepresentation / self image rights angle that could be argued, same as with say faces. That wouldn't seem unfair to me. It should require a lot more than just pitch similarity imo.

illwrks · on Feb 8, 2021

I've had quite an odd idea this past year as video conferencing has taken off massively both for public figures and private individuals.

....what if you had a malicious video conferencing service that was free to all?

Because it's free it gains mass adoption,the service can target a person it is interested in and can then record all audio and video footage.

Then using this captured data train a real-time puppet.

You could then intercept any call that person is on and 'take over' and push your own agenda, or call people up and do the same, pretending to be that person.

With something like that it would add more value in ongoing OTP ways of verifying a person is who they say they are.

SubiculumCode · on Feb 8, 2021

Another good reason not to keep research and development of companies like Zoom in America.

PeterStuer · on Feb 8, 2021

This is an infomercial, by Respeecher.

bitlevel · on Feb 8, 2021

This guy speaking - with Obama's voice - is eerie.

https://www.youtube.com/watch?v=t5yw5cR79VA&feature=emb_logo

FloatArtifact · on Feb 8, 2021

I have a feeling the text to speech market is suppressed. For text-to-speech services such as Google or Amazon it costs more (end user perspective) to process audiobooks from eBooks. Compared to buy the audiobook through a subscription service.

Deep fakes audio sounds better than most text to speech models that are freely available.