Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...
The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.
> The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
> We don't train what you don't need to hear
This does sound a lot more interesting than the ~280 lines of code.
For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.
For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.
The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
We don't train what you don't need to hear
If you want to play around with the TEVR tokenizer design, here's the source for that: https://huggingface.co/fxtentacle/tevr-token-entropy-predict...