Author here. Speech-to-text is more or less solved, it's easy to automatically g...

Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037