Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's always been a puzzle to me that published WER is so low, and yet when I use dictation (which is a lot--I use it for almost all text messages) I always have to make numerous corrections.

This sentence is crucial and suggests a way to understand how WER can be apparently lower than human levels, yet ASR is still obviously imperfect:

> When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.

This suggests that humans make mistakes specifically on semantically unimportant words, while computers make mistakes more uniformly. That is, humans are able to allocate resources to correct word identification for the important words, with less resources going to the less important ones. So maybe the way to improve speech recognition is not to focus on WER, but on WER weighted by word importance, or to train speech recognition systems end-to-end with some end goal language task so that the DNN or whatever learns to recognize the important words for the task.



The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.


Very true. Which is why all the home assistants (Google Home, Amazon Echo etc) use array mics and beamforming - they get a much cleaner speech signal from far field audio, and better WER as a result


Exactly. This is also why Google sponsors the CHiME challenge, the existence of which is more proof that ASR is pretty far from solved.

http://spandh.dcs.shef.ac.uk/chime_challenge/


Good stuff. Was looking into cheap array mics with linux drivers a few times in the past but not much is available.

Speech separation - Mitsubishi Research has done some pretty impressive stuff on that - http://www.merl.com/demos/deep-clustering. Haven't seen equivalents of that in open source ASR


This is the useful info I've seen hardware-wise recently:

https://medium.com/snips-ai/benchmarking-microphone-arrays-r...

https://developer.amazon.com/alexa-voice-service/dev-kits/co...

Amazon's 7-mic hardware has its own OEM program.


Totally. You may want to take a look at the papers from CHiME4 for more along those lines:

http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/resul...

I'm really fascinated by the whole idea of blind source separation and the fact that speech signals are "sparse" in frequency space.

We've had a similar experience looking for hardware / open source beamforming. There's a package called beamformit, but I think it's pretty old.


This is, to me, one of the major problems with many algorithmic solutions to problems. An x% increase does in precision, F measure or any other score does in no way mean that the results are better.

I've repeatedly seen improvements to traditional measures that make the subjective result worse.

It's incredibly hard to measure and solve (if anyone has good ideas please let me know). I check a lot of sample data manually when we make changes, doing that (with targeting at important cases) is really the only way I think to do things.


If you've got a dictation system on a phone, wouldn't a very good metric be the corrections people make after dictating?

I guess a problem would be if people become so used to errors that they send messages without corrections. I have some friends who do this: they send garbled messages that I have to read out loud to understand. But there will always be a subset of people who want to get it right.


Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.


I agree. I think a big part of the reason that people use WER is that it's relatively unambiguous and easy to measure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: