Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Neural networks are used for nearly all of ASR now. Last I heard only the spectral components were still calculated not using a neural net and the text-to-speech is now entirely neural network (i.e. you feed text in and get audio samples out). I'd be surprised if they don't do that for ASR too soon if they haven't already.


Although some models are end-to-end neural nets, most of the ones in production (and all of the ones that get state of the art results) only use a neural net for one part of the process. Lots of people are as surprised as you, but that's the way it is.

Edit: I should say that in state of the art results there tend to be multiple components, including multiple neural nets and the tricky "decode graph" that gok and I are talking about. These are trained separately then get stuck together, as opposed to being trained in an end-to-end fashion.


Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).

[1] https://arxiv.org/abs/1610.09975


Yes, exactly. I do wonder whether a similarly good end-to-end system could be trained by constraining the alignments as I've seen done in some papers.


AFAIK state of the art models are hybrid of HMM/GMM and CNN for phoneme classification. There are exotic CTC/RNN based architectures for end-to-end recognition but they aren't state of the art.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: