Translatotron: An End-to-End Speech-to-Speech Translation Model

Nr7 · on May 17, 2019

Cool stuff. I wonder how long it will take until speech synthesis is at the point where it can be used to create dialogue for video games. Obviously it would (at least initially) replace just the less important stuff spoken by background characters or less important NPCs, but even that could be a big improvement.

Just imagine how much more immersive a game like Skyrim would be if the writers could just write the lines and then run it through a speech synthesizer to get finished dialogue. No need to hire multiple actors and get them to a studio to record their lines. It would be so much easier and faster to create a massive amount of unique dialogue and you wouldn't have to listen to the same "arrow to the knee" line spoken by the same couple of actors over and over again everywhere you go.

It would improve user created mods as well since just about anyone could then just create new characters with completely custom voices and dialogue.

indy · on May 17, 2019

We could see a resurgence in text adventures, except this time the input would be voice based and the output would be to your headphones.

Downside to this would be people on public transport saying stuff like: "Use magic spell on trolls"

logicchains · on May 17, 2019

Sounds like a pretty big upside to me.

petrochukm · on May 17, 2019

Hey...

By the way, we have a product launched specifically targetted towards that use case: https://wellsaidlabs.com/

Just sayin'

We're hiring! michael[at]wellsaidlabs[dot]com

Mirioron · on May 17, 2019

Screw Skyrim. Imagine what indie developers could do! It would be particularly amazing if this voice could be generated on the fly. But this is still a ways away.

sansnomme · on May 17, 2019

Imagine dwarf fortress with realistic speech...

bowmessage · on May 17, 2019

Is it just me or are the final results (Translatotron translation) not playable, while the initial audio samples work fine?

barrystaes · on May 17, 2019

Not just you, same problem here. (Firefox on Windows)

Workaround: If you look at the WAV file the player loads (right click > Inspect Element), that URL is playable in a new tab/browser window.

fartcannon · on May 17, 2019

Did you by bchance notice what is different regarding the other files?

gattilorenz · on May 17, 2019

The original (source) is a PCM S16 LE mono file, 8kHz/16 bit. The translations are 32 bits float LE mono file, 24kHz/32 bit.

Firefox supports 8-bit and 16-bit PCM only ( https://support.mozilla.org/en-US/kb/html5-audio-and-video-f... ). I don't know why, exactly.

londons_explore · on May 17, 2019

Thats silly... 16 bit audio really doesn't have enough dynamic range for many uses now. Listen to a bit of classical music with very loud and very quiet bits and you'll hear quantization hiss in the quiet bits.

16 bits allows a dynamic range of 96dB while sounds you might come across in everyday life can peak over 130dB briefly.

bildung · on May 17, 2019

While the actual analog sounds may indeed have a higher signal to noise ratio than 96db, most ADCs haven't, i.e. for most digital recordings 16bit should be more than enough.

jbay808 · on May 17, 2019

Since audio is usually sampled around 44 kHz, which is slow by signal processing standards, most ADCs can exceed 16 bits by supersampling and averaging the results.

a_imho · on May 17, 2019

Same FF on Ubuntu.

MrTrvp · on May 17, 2019

Works in chrome

gpvos · on May 17, 2019

:angry face:

solarkraft · on May 17, 2019

The one official web runtime environment.

mcemilg · on May 17, 2019

Same.

jonathankchang · on May 30, 2019

Hey mcemilg,

You asked about NREC earlier but I failed to set up email alerts correctly (and that post is locked now) so I'm replying here if that's ok.

Hiring process takes around a month or two, depending on how schedules line up. There would be a basic phone/skype screen, then a in-depth technical screen on your areas of expertise, and finally an on-site interview. We try to get back to you on each stage within a week of each stage.

We have a bunch of employees that are on H1-B's right now. Let me know if you have any other questions and sorry for the late reply!

kozak · on May 17, 2019

A perfect illustration for how Chromium's monopoly is killing the open web. The sample audio doesn't play in Firefox, because who cares about anything other than Chrome, right?

londons_explore · on May 17, 2019

To be fair, this looks like a firefox bug rather than google inventing their own 'standards'.

It's a bog-standard HTML5 audio tag with a WAV file. One file is 8khz and the other 24khz, both standard frequencies. One is Microsoft PCM format, while the other is IEEE float. I suspect the latter is the issue - even though it's been around for decades, I bet it isn't a well tested codepath in firefox.

It makes sense they use floats for machine learning outputs. Unless someone specifically thought to quantize the data to a specific bit depth, whatever wav file writing library google used probably thought it was being helpful by using the 'right' encoding.

kozak · on May 17, 2019

The problem is that they only tested it on Chrome.

emerongi · on May 17, 2019

This is a research project at Google.

I'm a Firefox fan myself, but this is on Firefox and nobody else. It should just work on Firefox; I wouldn't actually expect someone to do "testing" on a research article.

gattilorenz · on May 17, 2019

Maybe. On the other hand, the audio tag is pretty standard and the format is standard too, and if you look at w3c support page [1] it looks like ffox supports wav files (and it does, just not all of them. And they might just have tested if the first file plays (or even the second). I would say it's a firefox "bug" instead, there's not a good reason not to support this wav format.

And I say that as a Firefox user that never betrayed the fox for that shiny metallic look.

[1] https://www.w3schools.com/html/html5_audio.asp

londons_explore · on May 17, 2019

Especially since firefox probably ought to be using the underlying OS libraries for media decoding (and on windows, mac, and linux all the major libraries support this format).

Firefoxes "we need to invent it ourself so we aren't at the whim of the platform" has bitten them here.

Also, a patch to support this format is probably only 10 lines of code... Simply a for loop over every sample converting to 16 bit PCM.

05 · on May 17, 2019

Using underlying non-hardened OS libraries will open your browser to numerous exploits, see e.g. IE’s WMF vulns [0]

[0] https://en.wikipedia.org/wiki/Windows_Metafile_vulnerability

londons_explore · on May 17, 2019

Presumably the solution is to harden the OS rather than reinvent the wheel and leave all other applications vulnerable...

yorwba · on May 17, 2019

They might not have tested it at all. It's just a blog post after all. Plenty of blog posts get published with broken links or duplicate paragraphs and basically anything else that isn't immediately obvious at a glance. Being published on a Google domain doesn't guarantee that they have a sophisticated testing process to catch these kinds of errors even in Chrome.

kozak · on May 17, 2019

In software development (or better say, in engineering in general), nothing ever can be presumed to be working until it is tested.

skrebbel · on May 17, 2019

It's not a bug, it's using a feature (> 16bits per sample audio) that Firefox doesn't support:

https://support.mozilla.org/en-US/kb/html5-audio-and-video-f...

Sure, it's silly that Firefox doesn't support 24/32bit audio. But it's still Googlers assuming, as usual, that the entire world uses Google products and nothing else. If you can do audio code this impressive, then you can also downsample a waveform.

vlovich123 · on May 19, 2019

More like all the internal extensions at Google only work on Chrome (most importantly the U2F stuff for authentication which at some point might have better support in the future). That means it's a huge hassle to use a separate browser to even view an internal webpage (as this would be before publishing).

Right or wrong this isn't really about the Googlers that wrote this page or did the research but on the IT policy at Google that supports only Chrome. The policy itself isn't necessarily unreasonable either. What would you do given a finite headcount supporting 100k users across the world in all sorts of time zones? I'll also note that IT support at Google is probably the best, smoothest & on top of their game I've experienced having worked at a lot of companies including the biggest ones, so they're doing something right.

I wouldn't be surprised though that if this kind of negative feedback reaches this team they'll correct this page + make sure to test such interop issues for media going forward systematically in some way but these kind of blindspots are hard to catch especially when you're using standards that Firefox claims to support (I'll note that Safari doesn't have any issues with the audio).

jszymborski · on May 17, 2019

It's working fine for me FF Win10 v66.0.5

OrgNet · on May 17, 2019

Android's FF v66.0.5 also works fine...

OldSchoolJohnny · on May 17, 2019

It plays just fine on my W10 FireFox

akie · on May 17, 2019

I'm just going to say that this is flat out amazing. Really super impressive.

ttctciyf · on May 17, 2019

Am I reading this correctly, that there's no explicit semantic representation going on at any stage, it's purely audio input frequencies -> ML -> audio output frequencies? If so, that's ... so much for Jerry Fodor and his Language of Thought Hypothesis, eh? (Yeah, mostly joking but still..)

makomk · on May 17, 2019

If I'm understanding this right, the actual translation goes from audio frequencies -> ML -> audio frequencies, but the training process for the ML algorithm relies on text transcripts of the speech.

lostmsu · on May 23, 2019

Is the model available yet?

londons_explore · on May 17, 2019

There seem to be lots of questionable translations in their source data - missing emphasis, wrong words, and sometimes stuttering or mistaken vocalisations.

If they could fix that, I think results could be much better.