Online speech recognition with wav2letter anywhere

tingletech · on Jan 14, 2020

All I see is """Sorry, this content isn't available right now The link you followed may have expired, or the page may only be visible to an audience you're not in. Go back to the previous page · Go to News Feed · Visit our Help Center"""

Edit: found a link that works https://github.com/facebookresearch/wav2letter

vineelkpratap · on Jan 14, 2020

Here is the direct link - https://ai.facebook.com/blog/online-speech-recognition-with-...

rgj · on Jan 14, 2020

Same error (from NL)

vineelkpratap · on Jan 14, 2020

Fixed now !

dvduval · on Jan 13, 2020

So by open sourced I assume this means there are absolutely no Facebook dependencies where the voice is passing through a Facebook server? Sorry, have to ask, as my trust level is low. Otherwise, awesome!

notduncansmith · on Jan 13, 2020

The repository (https://github.com/facebookresearch/wav2letter) claims to come with pre-trained models for automated speech recognition.

snissn · on Jan 13, 2020

that's cool! I wonder how it works against podcasts

throwawayhhakdl · on Jan 13, 2020

> Trained models: ...

I think I remember a similar thing happening with previous wav2letter releases.

I would love for a simple tutorial on just using a pretrained model but that feels unlikely to ever happen

snippyhollow · on Jan 13, 2020

The models are here: https://github.com/facebookresearch/wav2letter/tree/master/r... https://github.com/facebookresearch/wav2letter/tree/master/r...

vineelkpratap · on Jan 13, 2020

Yes, nothing goes through Facebook servers. The model will be run locally on the machine.

_pd19 · on Jan 14, 2020

Not only does it mean that, it means that you can look at the source code yourself rather than asking the question on HN

gliese1337 · on Jan 13, 2020

Online speech recognition for English.

The framework should be generalizable, but the models they are making available are only for English. Actually adapting this for any other language would be a huge amount of additional work.

londons_explore · on Jan 13, 2020

> would be a huge amount of additional work.

A small amount of additional work, and a huge amount of money to pay for dataset collection and compute for training.

gliese1337 · on Jan 14, 2020

The dataset collection is going to be work for someone.

moron4hire · on Jan 13, 2020

This is one of the reasons why we're using Azure Cognitive Services. https://docs.microsoft.com/en-us/azure/cognitive-services/sp...

That, and (at least for English) the results are the most accurate I've ever seen.

Jnr · on Jan 13, 2020

How does this compare to Mozilla's DeepSpeech?

And does anyone know when Mozilla will release the updated Common Voice dataset from https://voice.mozilla.org ?

zachruss92 · on Jan 13, 2020

I've been following DeepSpeech for a while. They have a WER IN THE 7% range and wave2letters SOTA model is at around 5%.

I haven't used wav2letter, but I can run DeepSpeech on my (low powered) laptop with faster than real-time transcription with just the CPU.

londons_explore · on Jan 13, 2020

Word error rate depends heavily on dataset.

All modern models get to ~human level when tested on individual phrases or sentences.

None yet get to human level when the source is many paragraphs long, because the human benefits from context and 'getting used to the accent', which ML has so far not achieved.

Nimitz14 · on Jan 13, 2020

It is misleading to use librispeech WER as a general guide to real world WER. Don't do that.

dodobirdlord · on Jan 13, 2020

wav2letter outperforms, in large part because it seems to make better use of more training data. Facebook’s original paper shows that wav2letter outperforms when trained with their very large internal dataset, and they included a reproduction on a smaller open source dataset with worse overall accuracy but again wav2letter outperforming.

jwineinger · on Jan 13, 2020

I'd love a tutorial that shows a normal guy like me how to use this tool with the pre-trianed models to transcribe my audio files. Not finding anything of that kind included there.

vineelkpratap · on Jan 13, 2020

Check out the tutorial here - https://github.com/facebookresearch/wav2letter/wiki/Inferenc...

gok · on Jan 13, 2020

The preprint: https://research.fb.com/wp-content/uploads/2020/01/Scaling-u...

Interestingly, the baselines are all systems that model grapheme forms instead of acoustic (phonemes) directly.

faitswulff · on Jan 13, 2020

Speaking as a Facebook user, I'm a bit confused - where do they use speech recognition? Or is this just purely research oriented?

melling · on Jan 13, 2020

Voice user interfaces are becoming more common. Ignoring this technology is a bad idea. FB has VR devices and their portal device:

https://www.theverge.com/2019/9/18/20870866/facebook-portal-...

Sure, the average HN reader will tell you that they don't see the point, etc. But Amazon and Google have sold hundreds of millions of those little voice devices.

valvar · on Jan 13, 2020

Subtitles for videos is my guess.

CosmicShadow · on Jan 14, 2020

Didn't they "get caught" for having their apps listening all the time and everyone was wondering how they kept getting ads for things they only ever talked about?

yayitswei · on Jan 14, 2020

They could use microphone audio for ad targeting.

There's ongoing speculation around whether they're currently doing this, but it could be an ongoing area of research. I'd imagine efficient on-device transcription would help in this regard.

https://newatlas.com/computers/facebook-not-secretly-listeni...

https://www.wired.com/story/facebooks-listening-smartphone-m...

jauer · on Jan 14, 2020

Portal: https://portal.facebook.com/

yalok · on Jan 13, 2020

better ads targeting based on user-generated audio/video?

isoos · on Jan 13, 2020

I'd be really interested in the accuracy of this tool to solve Google audio captchas. I'm assuming the price of solving captchas will go further down.

londons_explore · on Jan 13, 2020

I wish recaptcha would let me disable audio captchas - I'm pretty sure all the spammers solve them that way.

The volume of real users that want to sign up to my site, and are blind, and don't have a Google account, and clear their cookies frequently enough to get an extra recaptcha challenge, and can't just call support to make an account for them, is probably zero.

Yet the number of spammers that come in that way must be in the millions by now.

Semaphor · on Jan 14, 2020

I’m thankful they don’t. Recaptcha already makes me usually close the tab instead, without audio captcha it would make me close it 100% of the time. I’m not spending 20 minutes hunting for dumb images.

cookie_monsta · on Jan 14, 2020

I'm not a spammer but I always use the audio option. Typing one word is way less onerous than clicking through multiple screens hunting for level crossings, fire hydrants, etc

jonathanleroux · on Jan 14, 2020

If I may insert a relevant plug: we (MERL) just put out a paper last week with SOTA 7.0 % WER on LibriSpeech test-other (vs wav2letter@anywhere's 7.5%) with 590 ms theoretical latency using joint CTC-Transformer with parallel time-delayed LSTM and triggered attention. Check it out: https://arxiv.org/abs/2001.02674

jonathanleroux · on Jan 15, 2020

Correction: no PTDLSTM here but time-restricted self-attention (duh). PTDLSTM was our previous encoder setup published at ASRU.

cproctor · on Jan 13, 2020

I'm about to start as a professor in CS education, and am hoping we're getting close to the point where I can easily transcribe interviews and high-quality dialogue audio using open-sourced models running on machines in my lab. I'm tired of paying $1/minute for human transcription that's not great anyway, and would love to undertake research that would require processing a lot more audio than is affordable on those terms.

I haven't kept up with developments over the last two years--anyone have a sense of whether this is close to being a reality?

(I've taken a bunch of Stanford's graduate AI courses on NLP and speech recognition; I can read documentation and deploy/configure models but don't have much appetite for getting into the weeds.)

kick · on Jan 14, 2020

Earlier this year the Media Lab did an absolutely ginormous automated transcription project. Off the top of my head, it was ~2.8 billion words. 13.1% error rate (vs. ~7% error rate for Google's proprietary solution).

Found the paper on it:

https://arxiv.org/pdf/1907.07073.pdf

woodson · on Jan 14, 2020

Sadly they don’t (well, can’t) release the audio+transcripts as dataset, as they clearly don’t own the rights.

kick · on Jan 14, 2020

They did release it as a dataset. I have a copy of it. It's massive. I'd recommend reading the paper, it has a link to a place where you can download it, and aside from that, it's also fascinating.

woodson · on Jan 14, 2020

Including the audio? I downloaded the transcripts from their bucket, but couldn’t find any information on how to obtain the corresponding recordings. At Interspeech 2019, the author basically told me that they couldn’t share it.

woodson · on Jan 15, 2020

If you have access to the corresponding audio recordings, would you be able to share them?

bgee · on Jan 14, 2020

Just curious: $1/min sounds like quite a bit of money, are you paying for some professional to do this?

If so have you compared with using Mechanical Turk?

flurie · on Jan 14, 2020

That’s minute of recorded audio, and that’s a pretty standard transcription rate. Using anything less than a professional service will show in the quality of the output, and even many services don’t produce high quality transcripts, especially those that use temp (often undergraduate/graduate student) labor.

bgee · on Jan 17, 2020

Minutes of recorded audio make a lot of sense, thanks!

ColanR · on Jan 13, 2020

So what's the efficiency of this model? Can I use it instead of pocketsphinx on a raspberry pi?

sp332 · on Jan 13, 2020

According to https://research.fb.com/wp-content/uploads/2020/01/Scaling-u... the benchmarks were run on "Intel Skylake CPUs with 18 physical cores and 64GB of RAM."

vineelkpratap · on Jan 13, 2020

Yes, in the paper we discuss our benchmarks on Intel CPUs. But, as we mention in the final section, we also made the system work efficiently on IOS and Android, but we haven't open-sourced them in this release. This will be in our future work.

For reference, our system runs at 0.1 RTF on iPhone 10 using Accelerate framework under FP16 precision. INT8 should be better but haven't benchmarked yet!

yorwba · on Jan 13, 2020

The real-time factor for a single audio stream looks like 0.1 (based on eyeballing the graph), so it should be possible to achieve acceptable speeds even with a slower CPU (maybe not a Pi). The memory requirements for the intermediate results are likely to be substantial, though. They say they have "carefully optimized memory use", but don't give any figures.

sp332 · on Jan 13, 2020

They benchmarked several configurations, but I couldn't match up which configuration of models produced which results. I was trying to figure out if you drop the CPU power so throughput just handles one person talking at a normal pace, whether latency would necessarily get crazy high.

fermenflo · on Jan 13, 2020

So no, lol

sp332 · on Jan 13, 2020

They say they're coming out with Android and iOS versions soon, so maybe take a look after that point to see how they've tweaked the models and if the error rates are a lot higher.

vineelkpratap · on Jan 13, 2020

FWIW, we have already working versions in Android, iOS but didn't have time to open-source it with the current release. This is certainly in our future work.

rexreed · on Jan 13, 2020

Given that this uses a beam search decoder to find the most likely word pattern, is it possible small perturbations in audio could cause it to improperly decode certain word strings? Sort of like the audio equivalent of adversarial attacks, but on ASR?

yellow_lead · on Jan 13, 2020

The name must be a nod to Word2Vec[1]. A cool naming scheme IMO.

[1] https://en.m.wikipedia.org/wiki/Word2vec

nmfisher · on Jan 14, 2020

Facebook Research actually have another toolkit called wav2vec that's based on the same principle as word2vec (self-supervised discriminative pretraining).

word2vec learns good feature representations for words by looking at a context window and adjusting weights for the ngram that fits within that window, simultaneously doing so in the opposite direction for randomly sampled ngrams.

wav2vec does something similar, learning good feature representations of audio by distinguishing a "real" sample frame from a randomly sampled frame. I can't remember whether wav2vec operates on the raw waveform, the Fourier transform or MFCCs, but the underlying principle is the same.

These learned feature vectors have been shown to dramatically improve downstream tasks.

CrazyStat · on Jan 14, 2020

___2___ has been a common naming scheme for conversion utilities for a long time.

starpilot · on Jan 14, 2020

Do the pretrained models work decently on landline phone quality recordings? I can see massive value for this if it can transcribe corporate call center audio.

woodson · on Jan 14, 2020

They can’t, because they are trained on more-or-less high-quality recordings of people reading books out loud. Phone conversations are very different, not just in audio quality but in the way people speak.

z3t4 · on Jan 20, 2020

For any project like this, please post exactly the sound configuration used for the model. eg. the rate (Hz), channels, and format.

amluto · on Jan 14, 2020

I wonder if this would be a good engine to plug in to rhasspy.

phkahler · on Jan 13, 2020

Which OSS license?

trakout · on Jan 13, 2020