I have yet to hear any example of good quality voice deepfakes. Every example I've heard so far was completely unconvincing, and absolute trash compared to human impersonators. Speaking of, human voice impersonation has been possible and quite convincing for forever.
Audio deepfakes, when they finally arrive for real, will simply reduce the effort required for voice impersonation. They do not make possible new feats of impersonation that were never possible before. They can still have an effect on society for sure, but I believe the concern is out of proportion with the effect.
If you look at the latest research in neural network based approaches to text-to-speech, you'll find plenty of examples of realistic synthetic voices. However, due to the nature of the available training data (audiobooks), these models 1) are restricted to reading text in a narrative style and 2) aren't generally trained on speech of recognizable public figures. There was a paper a while back that was trained on TED talks and produced fairly convincing voices, some of which you may recognize (Bill gates, Stephen Wolfram and some others - see here https://sjvasquez.github.io/blog/melnet/) Definitely not 100% convincing, but I wouldn't call it absolute trash.
The effort will be reduced to typing sentences and selecting emotions from a dropdown menu. That opens up a whole new can of worms for a society that already has trust issues.
Making people doubt untrustworthy things by educating them through a proper process is good. Exposing an unsuspecting/uneducated population to manipulated content designed to deceive/fool them for nefarious purposes is not an ideal process if the goal is to make the general public have better judgement.
This is a global issue. For examples, please look into the societal concerns surrounding 'deep fakes', and 'fake news' for answers to your first two questions.
Pitting people in a social game of who can be fooled the easiest sounds like a more effective approach than reading a dry textbook at them. There's shame in being seen as gullible. This happens already in casual conversation and people already have opportunities to believe false information from all over the internet or in books or tabloid newspapers or told to them by their peers or parents.
I don't believe the concerns about deep fakes and fake news. I think they're no worse than what we've always had. Those fears are exaggerated by arrogant hand-wringers who somehow assume their own beliefs are exempt from manipulation and it's only intellectually inferior "other" people who are susceptible. They're so fixated on the popular political ideas that they forget they should be on a crusade against religion which is the elephant in the room when it comes to people being misled.
I wouldn’t call the voices unconvincing. They used to have a better demo though where it would impersonate celebrities like Trump. It was very interesting to demo it.
I am using descript for a project of mine. You can definitely still hear that they are AI-generated. But I'm sure it's not gonna take long. Couple this with SSML and it's going to be quite convincing.
> They can still have an effect on society for sure, but I believe the concern is out of proportion with the effect.
I believe the contrary: that high-quality, convincing audio (and video) deepfakes will be cataclysmic for society. Voice audio is one of the last digital artifacts that people will believe with their senses.
When Donald Trump’s “grab them by the p—“ audioclip came out, he was at least forced to acknowledge it.
Once high-quality audio deepfake tools become available, amateurs will be able to flood the web with fake audioclips of politicians saying things they never did, and truly unscrupulous politicians will be given technological cover for any damning audioclips of theirs that get leaked.
Once we cannot trust our senses, then we will be only able to trust our institutions and then our tribes. But trust in institutions is already decaying rapidly, so in the end there only ideological tribes will be left.
automating sending out fake emails allowed any idiot to do it.
there is a difference in scale and effect between every idiot in the world can fake as many impersonations of people that they want (and they can coordinate this impersonation with 100s of others who have a goal with this impersonation) and Rich Little calling up somebody's granny and claiming to be Jimmy Stewart.
The Jordan Peterson df was quite good. I made a few sound bytes to send to some friends as a gag, but they were already aware it was in the wild. The compression is probably the worst here, but it was fun when it was up. https://www.notjordanpeterson.com/
There’s a whole other avenue that’s not covered here: theft of voice actors voices. It was covered by a voice actor on YouTube, as a warning for other actors. The first is only two minutes or so, the second is a bit over ten, but goes into more detail.
Polish game modders for the game Gothic (very popular RPG in Poland) are using "AI" to reproduce the voices of the Polish voice actors who did the original dub of the game. The quality isn't there yet, the voice clarity is all over the place, but it's already a much better option than using a generic "Microsoft Sam" kind of voice. And when technology matures, it will allow modders to have in-game characters sound as they "should sound like", instead of being silent or resorting to amateur dubbing as it is now.
It may be a grey zone now but I find it likely that judges will decide that copyright does cover the sound of someone's voice and other aspects (intonation, accent), as these are vital aspect of the performance and often the reason why a voice actor gets hired.
I'd go much farther than that. Unique features of someone's voice and appearance (face, walk) should not only be protected by copyright but also by privacy laws. It would be crazy if producers were allowed to make arbitrary videos with anyone's appearance and voice, putting words in your mouth you'd never endorse.
Luckily, this may be one of the cases when politics will react fast and swiftly, if necessary with new laws, since politicians would likely be among the first victims of the new trend.
They are free to make money during the term of the copyright, but that term should be reduced by a lot. And I don't think the definition of copyright should be extended to include someone's voice, style, appearance, recipes, or others that I'm leaving out.
Is it an extension? (I know technically it is). Voice could be captured as accent/pitch/tone... I guess it’s a problem as there would be too much natural overlap, so not really a novel expression.
Maybe voice would be better suited to trademark protection? After all, it isn't really about accent/pitch/tone, but about being sure that if something sounds like it's voiced by $VoiceActor, it's actually voiced by $VoiceActor (and $VoiceActor gets their cut of royalties/recognition).
There's probably a trademark / misrepresentation / self image rights angle that could be argued, same as with say faces. That wouldn't seem unfair to me. It should require a lot more than just pitch similarity imo.
I've had quite an odd idea this past year as video conferencing has taken off massively both for public figures and private individuals.
....what if you had a malicious video conferencing service that was free to all?
Because it's free it gains mass adoption,the service can target a person it is interested in and can then record all audio and video footage.
Then using this captured data train a real-time puppet.
You could then intercept any call that person is on and 'take over' and push your own agenda, or call people up and do the same, pretending to be that person.
With something like that it would add more value in ongoing OTP ways of verifying a person is who they say they are.
I have a feeling the text to speech market is suppressed. For text-to-speech services such as Google or Amazon it costs more (end user perspective) to process audiobooks from eBooks. Compared to buy the audiobook through a subscription service.
Deep fakes audio sounds better than most text to speech models that are freely available.
Audio deepfakes, when they finally arrive for real, will simply reduce the effort required for voice impersonation. They do not make possible new feats of impersonation that were never possible before. They can still have an effect on society for sure, but I believe the concern is out of proportion with the effect.