Cool stuff. I wonder how long it will take until speech synthesis is at the point where it can be used to create dialogue for video games. Obviously it would (at least initially) replace just the less important stuff spoken by background characters or less important NPCs, but even that could be a big improvement.
Just imagine how much more immersive a game like Skyrim would be if the writers could just write the lines and then run it through a speech synthesizer to get finished dialogue. No need to hire multiple actors and get them to a studio to record their lines. It would be so much easier and faster to create a massive amount of unique dialogue and you wouldn't have to listen to the same "arrow to the knee" line spoken by the same couple of actors over and over again everywhere you go.
It would improve user created mods as well since just about anyone could then just create new characters with completely custom voices and dialogue.
Screw Skyrim. Imagine what indie developers could do! It would be particularly amazing if this voice could be generated on the fly. But this is still a ways away.
Thats silly... 16 bit audio really doesn't have enough dynamic range for many uses now. Listen to a bit of classical music with very loud and very quiet bits and you'll hear quantization hiss in the quiet bits.
16 bits allows a dynamic range of 96dB while sounds you might come across in everyday life can peak over 130dB briefly.
While the actual analog sounds may indeed have a higher signal to noise ratio than 96db, most ADCs haven't, i.e. for most digital recordings 16bit should be more than enough.
Since audio is usually sampled around 44 kHz, which is slow by signal processing standards, most ADCs can exceed 16 bits by supersampling and averaging the results.
You asked about NREC earlier but I failed to set up email alerts correctly (and that post is locked now) so I'm replying here if that's ok.
Hiring process takes around a month or two, depending on how schedules line up. There would be a basic phone/skype screen, then a in-depth technical screen on your areas of expertise, and finally an on-site interview. We try to get back to you on each stage within a week of each stage.
We have a bunch of employees that are on H1-B's right now. Let me know if you have any other questions and sorry for the late reply!
A perfect illustration for how Chromium's monopoly is killing the open web. The sample audio doesn't play in Firefox, because who cares about anything other than Chrome, right?
To be fair, this looks like a firefox bug rather than google inventing their own 'standards'.
It's a bog-standard HTML5 audio tag with a WAV file. One file is 8khz and the other 24khz, both standard frequencies. One is Microsoft PCM format, while the other is IEEE float. I suspect the latter is the issue - even though it's been around for decades, I bet it isn't a well tested codepath in firefox.
It makes sense they use floats for machine learning outputs. Unless someone specifically thought to quantize the data to a specific bit depth, whatever wav file writing library google used probably thought it was being helpful by using the 'right' encoding.
I'm a Firefox fan myself, but this is on Firefox and nobody else. It should just work on Firefox; I wouldn't actually expect someone to do "testing" on a research article.
Maybe. On the other hand, the audio tag is pretty standard and the format is standard too, and if you look at w3c support page [1] it looks like ffox supports wav files (and it does, just not all of them. And they might just have tested if the first file plays (or even the second). I would say it's a firefox "bug" instead, there's not a good reason not to support this wav format.
And I say that as a Firefox user that never betrayed the fox for that shiny metallic look.
Especially since firefox probably ought to be using the underlying OS libraries for media decoding (and on windows, mac, and linux all the major libraries support this format).
Firefoxes "we need to invent it ourself so we aren't at the whim of the platform" has bitten them here.
Also, a patch to support this format is probably only 10 lines of code... Simply a for loop over every sample converting to 16 bit PCM.
They might not have tested it at all. It's just a blog post after all. Plenty of blog posts get published with broken links or duplicate paragraphs and basically anything else that isn't immediately obvious at a glance. Being published on a Google domain doesn't guarantee that they have a sophisticated testing process to catch these kinds of errors even in Chrome.
Sure, it's silly that Firefox doesn't support 24/32bit audio. But it's still Googlers assuming, as usual, that the entire world uses Google products and nothing else. If you can do audio code this impressive, then you can also downsample a waveform.
More like all the internal extensions at Google only work on Chrome (most importantly the U2F stuff for authentication which at some point might have better support in the future). That means it's a huge hassle to use a separate browser to even view an internal webpage (as this would be before publishing).
Right or wrong this isn't really about the Googlers that wrote this page or did the research but on the IT policy at Google that supports only Chrome. The policy itself isn't necessarily unreasonable either. What would you do given a finite headcount supporting 100k users across the world in all sorts of time zones? I'll also note that IT support at Google is probably the best, smoothest & on top of their game I've experienced having worked at a lot of companies including the biggest ones, so they're doing something right.
I wouldn't be surprised though that if this kind of negative feedback reaches this team they'll correct this page + make sure to test such interop issues for media going forward systematically in some way but these kind of blindspots are hard to catch especially when you're using standards that Firefox claims to support (I'll note that Safari doesn't have any issues with the audio).
Am I reading this correctly, that there's no explicit semantic representation going on at any stage, it's purely audio input frequencies -> ML -> audio output frequencies? If so, that's ... so much for Jerry Fodor and his Language of Thought Hypothesis, eh? (Yeah, mostly joking but still..)
If I'm understanding this right, the actual translation goes from audio frequencies -> ML -> audio frequencies, but the training process for the ML algorithm relies on text transcripts of the speech.
There seem to be lots of questionable translations in their source data - missing emphasis, wrong words, and sometimes stuttering or mistaken vocalisations.
If they could fix that, I think results could be much better.
Just imagine how much more immersive a game like Skyrim would be if the writers could just write the lines and then run it through a speech synthesizer to get finished dialogue. No need to hire multiple actors and get them to a studio to record their lines. It would be so much easier and faster to create a massive amount of unique dialogue and you wouldn't have to listen to the same "arrow to the knee" line spoken by the same couple of actors over and over again everywhere you go.
It would improve user created mods as well since just about anyone could then just create new characters with completely custom voices and dialogue.