Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We support voice cloning so you can mimic the sound of any real voice (or try to create random ones). The prosody/emotions are more difficult to control right now but we are looking into this.

To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.



Sounds excellent! What are the requirements to run this regarding hardware? How much VRAM? Does it work on AMD or Intel Arc?


Both models are using around 3GB right now (converted into FP16 for speed). But I checked that the (slower) FP32 version uses 2.3GB so we are probably doing something suboptimal here.

We support CUDA right now although it should not be too hard to port it to whisper/llama.cpp or Apple's MLX. It's a pretty straightforward transformer architecture.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: