"Because of this, at this time we will not be releasing our research, model or datasets publicly."
Seems like people are designing research for this conclusion, begging their own controversy. However, it rings hollow, especially presented thusly, on medium, with the first person plural voice of a corporation.
Ethical discussions in machine learning technology presentations are becoming trite and self-congratulatory ("we've made an AI so good it merits discussion of the ethical implications") especially when a discussion of actual applications is missing.
Because of this, at this time we will not be releasing our research, model or datasets publicly.
Has OpenAI's handling of GPT-2 inadvertently provided political cover for commercial organizations who would love to claim they engage with the ML research community but would actually prefer to contribute nothing other than medium articles?
So I've seen multiple of these ML speech synthesis projects, when am I going be able to use one of these for a screen reader? I'd like to listen to wiki articles with a synthesizer using modern methods, not Microsoft Sam.
> have produced the most realistic AI simulation of a voice we’ve heard to date.
No you didn't. Please do not lie.
There are a number of projects replicating Google's Tacotron 2 research from December 2017 that achieved human parity in text-to-speech as measured by MOS score. Google's Tacotron 2 model was then successfully deployed by Google in a service called Duplex.
Following up on this research, there are a number of open source and commercial projects that have used Google's Tacotron 2 human-parity TTS research:
> he didn’t actually endorse our work like this, it’s a clip from the video the team created featuring their work. Video and more after the jump!
It's absolutely unresponsible/illegal to clone a person's voice without consent. To use Joe Rogan's likeness for your publicity stunt without his consent is unethical in its self. It's Joe Rogan's legal right to control his own likeness.
Furthermore, this presents a number of safety risks to Joe Rogan including the possibility of identity fraud.
Finally, at this time, TTS human parity technology is at human-parity when tested on phrases and sentences similar to those in the training set. Google's Tacotron-2 models showed a significant decrease in performance reading 37 news headlines. They mentioned in their evaluation:
> This result points to a challenge for end-to-end approaches – they require training on data that cover intended usage.
There's a telltale discontinuity in the rise of the voice on the word "chimps" in the sentence "and these chimps have been working out hard". I wonder if future generations of kids will have to be trained to spot such things.
We've already got a world where anything slightly embarrassing or regrettable you do is likely to recorded and uploaded to Youtube. Maybe once there's tools that can perfectly fake a video and corresponding audio, we'll be free again.
Remember in Fahrenheit 451 where Guy Montag's partner is glued to the Wall? A screen where she participates in her favorite shows and the audience and cast members talk directly to her?
Read in a modern context I don't think Guy was merely burning books in service to an authoritarian government. I think he kept the books because he was becoming an outsider. He didn't want to participate in the world of the Wall and consent to the expectations and norms of his society. And the firemen were there to ensure everyone participated.
Impressive (really), but it raises a more philosophical question (as in practical ethics): do we really want voice-bots to blend in perfectly, or should they better feature distinctive marks (like a rather monotonous personality)?
[Edit: This is not so much about DeepFakes, as discussed in the article, but more about a general level of implementation.]
I suppose, we'd want bots distinguished (by tone, etc). Where's the practical value of not being able to discern an algorithmic speaker, e.g., on the phone. There's probably some value in being able to do so, regarding liabilities and so on. (A contract arises from an agreement of intents. We may not be sure, if such an agreement has actually been reached, or if we were just witnessing a behavioral pattern triggered by a Markov chain. We may also question the nature of the intent or who's intent this actually is.)
I would argue that while yes, I want to be "in on it" as far as knowing the realness of the speaker, I also want AIs that do sound natural. Even the Google thing that handles phone reservations or whatever: so long as I'm aware I'm listening to an AI, we're good.
If I want a Dan Rather news reader application that parses text and says it to me (and obviously, Dan Rather is okay with it), I see no issue with that. I don't want to be distracted by the artificial tone and attempting to parse it on my end.
So, on one hand, I see people hyping half-baked AI through the roof, cherry-picking good examples, refusing to study and discuss its limitations and even outright dismissing the idea that AI failures are, in fact, failures, rather than some kind of "different way of thinking".
On the other hand, I see the same crowd engaging in ridiculous alarmism that's not grounded in reality. They place technologies in far-fetched scenarios, completely ignoring that the same scenarios can already be enacted without AI. The usual conclusion is always that technologies needs to be kept out of the hands of the public.
Someone is drinking too much of their own Kool-Aid. But regardless of how much they believe in what they're posting, this behavior is disgusting and unethical.
---
>Here are some examples of what might happen if the technology got into the wrong hands
Since when do we start a discussion with the assumption that a piece of software will be restricted in distribution? Software tends to get in the hands of everyone who wants it.
>Spam callers impersonating your mother or spouse to obtain personal information
News flash: this is already happening without AI. All you need is a bad phone connection and someone who sounds vaguely like the person being impersonated.
Moreover, it's already trivial to change the pitch of your voice in real time. With some simple audio engineering, you can alter timbre as well (e.g. filtering, equalization). If that's such a big deal, why is no one using this already? It's way, way, way easier than collecting lots of voice samples and training a model.
>Impersonating someone for the purposes of bullying or harassment
Why would someone need to impersonate someone else for bullying or harassment? Bullying or harassment seems to work pretty well as is.
>Gaining entrance to high security clearance areas by impersonating a government official
If someone can get access to a place simply by using voice coming from computer speakers, it's clearly not a "high security clearance area".
>An ‘audio deepfake’ of a politician being used to manipulate election results or cause a social uprising
Media organizations already do this every day, in plain sight, via selective editing.
i look forward to see this applied to audio books. this will bring their price down drastically. even to zero. i may just buy the text version, and have a program to generate voice as it reads the book.
there are tools that do that now, but i find that i can't listen to the current quality of computer generated voices for more than a few minutes.
with an almost human like voice i don't think i'll care if there is the occasional glitch that makes me realize that it is a generated voice as long as it sounds fine otherwise.
Sounds good in terms of lack of pauses between words and his general voice, but you can tell something is off due to the cadence and at times it seems like words trail off into breathlessness.
Hot damn, as someone who has listened to a few JRE episodes, it is... quite good. What really struck me was its ability to recreate words that were (I'm guessing) weren't said by him in the past. Impressive!
Seems like people are designing research for this conclusion, begging their own controversy. However, it rings hollow, especially presented thusly, on medium, with the first person plural voice of a corporation.
Ethical discussions in machine learning technology presentations are becoming trite and self-congratulatory ("we've made an AI so good it merits discussion of the ethical implications") especially when a discussion of actual applications is missing.