Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...

The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

We don't train what you don't need to hear

If you want to play around with the TEVR tokenizer design, here's the source for that: https://huggingface.co/fxtentacle/tevr-token-entropy-predict...



So does this means the recogniser would be worse at recognising unexpected utterances, which is roughly what you'd see with human recognition ?

What's the German equivalent of "How to Wreck a Nice Beach"?


Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.


Eishockey, Kanufahren, Wirsing.

(Alles okay. Kann noch fahren. Wiedersehen!)


> The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:

> We don't train what you don't need to hear

This does sound a lot more interesting than the ~280 lines of code.


For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.

For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: