Neural networks are used for nearly all of ASR now. Last I heard only the spectr...

adrianbg · on Oct 24, 2017

Although some models are end-to-end neural nets, most of the ones in production (and all of the ones that get state of the art results) only use a neural net for one part of the process. Lots of people are as surprised as you, but that's the way it is.

Edit: I should say that in state of the art results there tend to be multiple components, including multiple neural nets and the tricky "decode graph" that gok and I are talking about. These are trained separately then get stuck together, as opposed to being trained in an end-to-end fashion.

woodson · on Oct 24, 2017

Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).

[1] https://arxiv.org/abs/1610.09975

adrianbg · on Oct 25, 2017

Yes, exactly. I do wonder whether a similarly good end-to-end system could be trained by constraining the alignments as I've seen done in some papers.

computerex · on Oct 25, 2017

AFAIK state of the art models are hybrid of HMM/GMM and CNN for phoneme classification. There are exotic CTC/RNN based architectures for end-to-end recognition but they aren't state of the art.