It's always been a puzzle to me that published WER is so low, and yet when I use dictation (which is a lot--I use it for almost all text messages) I always have to make numerous corrections.
This sentence is crucial and suggests a way to understand how WER can be apparently lower than human levels, yet ASR is still obviously imperfect:
> When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.
This suggests that humans make mistakes specifically on semantically unimportant words, while computers make mistakes more uniformly. That is, humans are able to allocate resources to correct word identification for the important words, with less resources going to the less important ones. So maybe the way to improve speech recognition is not to focus on WER, but on WER weighted by word importance, or to train speech recognition systems end-to-end with some end goal language task so that the DNN or whatever learns to recognize the important words for the task.
The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.
Very true. Which is why all the home assistants (Google Home, Amazon Echo etc) use array mics and beamforming - they get a much cleaner speech signal from far field audio, and better WER as a result
Good stuff. Was looking into cheap array mics with linux drivers a few times in the past but not much is available.
Speech separation - Mitsubishi Research has done some pretty impressive stuff on that - http://www.merl.com/demos/deep-clustering. Haven't seen equivalents of that in open source ASR
This is, to me, one of the major problems with many algorithmic solutions to problems. An x% increase does in precision, F measure or any other score does in no way mean that the results are better.
I've repeatedly seen improvements to traditional measures that make the subjective result worse.
It's incredibly hard to measure and solve (if anyone has good ideas please let me know). I check a lot of sample data manually when we make changes, doing that (with targeting at important cases) is really the only way I think to do things.
If you've got a dictation system on a phone, wouldn't a very good metric be the corrections people make after dictating?
I guess a problem would be if people become so used to errors that they send messages without corrections. I have some friends who do this: they send garbled messages that I have to read out loud to understand. But there will always be a subset of people who want to get it right.
Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.
This sentence is crucial and suggests a way to understand how WER can be apparently lower than human levels, yet ASR is still obviously imperfect:
> When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.
This suggests that humans make mistakes specifically on semantically unimportant words, while computers make mistakes more uniformly. That is, humans are able to allocate resources to correct word identification for the important words, with less resources going to the less important ones. So maybe the way to improve speech recognition is not to focus on WER, but on WER weighted by word importance, or to train speech recognition systems end-to-end with some end goal language task so that the DNN or whatever learns to recognize the important words for the task.