It's always been a puzzle to me that published WER is so low, and yet when I use...

adrianbg · on Oct 24, 2017

The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.

dharma1 · on Oct 24, 2017

Very true. Which is why all the home assistants (Google Home, Amazon Echo etc) use array mics and beamforming - they get a much cleaner speech signal from far field audio, and better WER as a result

adrianbg · on Oct 24, 2017

Exactly. This is also why Google sponsors the CHiME challenge, the existence of which is more proof that ASR is pretty far from solved.

http://spandh.dcs.shef.ac.uk/chime_challenge/

dharma1 · on Oct 24, 2017

Good stuff. Was looking into cheap array mics with linux drivers a few times in the past but not much is available.

Speech separation - Mitsubishi Research has done some pretty impressive stuff on that - http://www.merl.com/demos/deep-clustering. Haven't seen equivalents of that in open source ASR

j_s · on Oct 25, 2017

This is the useful info I've seen hardware-wise recently:

https://medium.com/snips-ai/benchmarking-microphone-arrays-r...

https://developer.amazon.com/alexa-voice-service/dev-kits/co...

Amazon's 7-mic hardware has its own OEM program.

adrianbg · on Oct 24, 2017

Totally. You may want to take a look at the papers from CHiME4 for more along those lines:

http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/resul...

I'm really fascinated by the whole idea of blind source separation and the fact that speech signals are "sparse" in frequency space.

We've had a similar experience looking for hardware / open source beamforming. There's a package called beamformit, but I think it's pretty old.

IanCal · on Oct 24, 2017

This is, to me, one of the major problems with many algorithmic solutions to problems. An x% increase does in precision, F measure or any other score does in no way mean that the results are better.

I've repeatedly seen improvements to traditional measures that make the subjective result worse.

It's incredibly hard to measure and solve (if anyone has good ideas please let me know). I check a lot of sample data manually when we make changes, doing that (with targeting at important cases) is really the only way I think to do things.

canjobear · on Oct 24, 2017

If you've got a dictation system on a phone, wouldn't a very good metric be the corrections people make after dictating?

I guess a problem would be if people become so used to errors that they send messages without corrections. I have some friends who do this: they send garbled messages that I have to read out loud to understand. But there will always be a subset of people who want to get it right.

skywhopper · on Oct 24, 2017

Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.

adrianbg · on Oct 24, 2017

I agree. I think a big part of the reason that people use WER is that it's relatively unambiguous and easy to measure.