Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).

[1] https://arxiv.org/abs/1610.09975



Yes, exactly. I do wonder whether a similarly good end-to-end system could be trained by constraining the alignments as I've seen done in some papers.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: