Separating acoustic model and decoding graph search makes sense since you would need a huge amount of (correctly!) transcribed speech for training. See, for example, this paper by Google [1], where they used 125,000 hours (after filtering out the badly transcribed ones from the original 500,000 hours of transcribed speech) for training an end-to-end acoustic-to-word model. Good "old-school" DNN acoustic models can already be trained with orders of magnitude less training data (hundreds to thousands of hours).
[1] https://arxiv.org/abs/1610.09975