Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037



Sweet!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: