All I see is """Sorry, this content isn't available right now
The link you followed may have expired, or the page may only be visible to an audience you're not in.
Go back to the previous page · Go to News Feed · Visit our Help Center"""
So by open sourced I assume this means there are absolutely no Facebook dependencies where the voice is passing through a Facebook server? Sorry, have to ask, as my trust level is low. Otherwise, awesome!
The framework should be generalizable, but the models they are making available are only for English. Actually adapting this for any other language would be a huge amount of additional work.
All modern models get to ~human level when tested on individual phrases or sentences.
None yet get to human level when the source is many paragraphs long, because the human benefits from context and 'getting used to the accent', which ML has so far not achieved.
wav2letter outperforms, in large part because it seems to make better use of more training data. Facebook’s original paper shows that wav2letter outperforms when trained with their very large internal dataset, and they included a reproduction on a smaller open source dataset with worse overall accuracy but again wav2letter outperforming.
I'd love a tutorial that shows a normal guy like me how to use this tool with the pre-trianed models to transcribe my audio files. Not finding anything of that kind included there.
Sure, the average HN reader will tell you that they don't see the point, etc. But Amazon and Google have sold hundreds of millions of those little voice devices.
Didn't they "get caught" for having their apps listening all the time and everyone was wondering how they kept getting ads for things they only ever talked about?
There's ongoing speculation around whether they're currently doing this, but it could be an ongoing area of research. I'd imagine efficient on-device transcription would help in this regard.
I wish recaptcha would let me disable audio captchas - I'm pretty sure all the spammers solve them that way.
The volume of real users that want to sign up to my site, and are blind, and don't have a Google account, and clear their cookies frequently enough to get an extra recaptcha challenge, and can't just call support to make an account for them, is probably zero.
Yet the number of spammers that come in that way must be in the millions by now.
I’m thankful they don’t. Recaptcha already makes me usually close the tab instead, without audio captcha it would make me close it 100% of the time. I’m not spending 20 minutes hunting for dumb images.
I'm not a spammer but I always use the audio option. Typing one word is way less onerous than clicking through multiple screens hunting for level crossings, fire hydrants, etc
If I may insert a relevant plug: we (MERL) just put out a paper last week with SOTA 7.0 % WER on LibriSpeech test-other (vs wav2letter@anywhere's 7.5%) with 590 ms theoretical latency using joint CTC-Transformer with parallel time-delayed LSTM and triggered attention.
Check it out: https://arxiv.org/abs/2001.02674
I'm about to start as a professor in CS education, and am hoping we're getting close to the point where I can easily transcribe interviews and high-quality dialogue audio using open-sourced models running on machines in my lab. I'm tired of paying $1/minute for human transcription that's not great anyway, and would love to undertake research that would require processing a lot more audio than is affordable on those terms.
I haven't kept up with developments over the last two years--anyone have a sense of whether this is close to being a reality?
(I've taken a bunch of Stanford's graduate AI courses on NLP and speech recognition; I can read documentation and deploy/configure models but don't have much appetite for getting into the weeds.)
Earlier this year the Media Lab did an absolutely ginormous automated transcription project. Off the top of my head, it was ~2.8 billion words. 13.1% error rate (vs. ~7% error rate for Google's proprietary solution).
They did release it as a dataset. I have a copy of it. It's massive. I'd recommend reading the paper, it has a link to a place where you can download it, and aside from that, it's also fascinating.
Including the audio? I downloaded the transcripts from their bucket, but couldn’t find any information on how to obtain the corresponding recordings. At Interspeech 2019, the author basically told me that they couldn’t share it.
That’s minute of recorded audio, and that’s a pretty standard transcription rate. Using anything less than a professional service will show in the quality of the output, and even many services don’t produce high quality transcripts, especially those that use temp (often undergraduate/graduate student) labor.
Yes, in the paper we discuss our benchmarks on Intel CPUs. But, as we mention in the final section, we also made the system work efficiently on IOS and Android, but we haven't open-sourced them in this release. This will be in our future work.
For reference, our system runs at 0.1 RTF on iPhone 10 using Accelerate framework under FP16 precision. INT8 should be better but haven't benchmarked yet!
The real-time factor for a single audio stream looks like 0.1 (based on eyeballing the graph), so it should be possible to achieve acceptable speeds even with a slower CPU (maybe not a Pi). The memory requirements for the intermediate results are likely to be substantial, though. They say they have "carefully optimized memory use", but don't give any figures.
They benchmarked several configurations, but I couldn't match up which configuration of models produced which results. I was trying to figure out if you drop the CPU power so throughput just handles one person talking at a normal pace, whether latency would necessarily get crazy high.
They say they're coming out with Android and iOS versions soon, so maybe take a look after that point to see how they've tweaked the models and if the error rates are a lot higher.
FWIW, we have already working versions in Android, iOS but didn't have time to open-source it with the current release. This is certainly in our future work.
Given that this uses a beam search decoder to find the most likely word pattern, is it possible small perturbations in audio could cause it to improperly decode certain word strings? Sort of like the audio equivalent of adversarial attacks, but on ASR?
Facebook Research actually have another toolkit called wav2vec that's based on the same principle as word2vec (self-supervised discriminative pretraining).
word2vec learns good feature representations for words by looking at a context window and adjusting weights for the ngram that fits within that window, simultaneously doing so in the opposite direction for randomly sampled ngrams.
wav2vec does something similar, learning good feature representations of audio by distinguishing a "real" sample frame from a randomly sampled frame. I can't remember whether wav2vec operates on the raw waveform, the Fourier transform or MFCCs, but the underlying principle is the same.
These learned feature vectors have been shown to dramatically improve downstream tasks.
Do the pretrained models work decently on landline phone quality recordings? I can see massive value for this if it can transcribe corporate call center audio.
They can’t, because they are trained on more-or-less high-quality recordings of people reading books out loud. Phone conversations are very different, not just in audio quality but in the way people speak.
Edit: found a link that works https://github.com/facebookresearch/wav2letter