Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here, thanks for the kind words! I think such a physics-based codec is unlikely to happen: in general, machine learning is always moving from handcrafted domain-specific assumptions to leaving as much as possible to the model. The more assumptions you bake in, the smaller the space of sounds you can model, so the quality is capped. Basically, modern ML is just about putting the right data into transformers.

That being said, having a more constrained model can also lead to some really cool stuff. The DDSP paper learns how to control a synthesizer to mimic instruments: https://arxiv.org/abs/2001.04643

You could probably do something similar for a speech model. The result would not sound as good but you could get away with much fewer parameters, because much of the modelling work is done by the assumptions you put in.

Compare also KokoroTTS, a tiny TTS that's so tiny because it uses a handcrafted system to turn text into phonemes, and then just synthesizes from those phonemes: https://huggingface.co/spaces/hexgrad/Kokoro-TTS



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: