We support voice cloning so you can mimic the sound of any real voice (or try to create random ones). The prosody/emotions are more difficult to control right now but we are looking into this.
To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.
Both models are using around 3GB right now (converted into FP16 for speed). But I checked that the (slower) FP32 version uses 2.3GB so we are probably doing something suboptimal here.
We support CUDA right now although it should not be too hard to port it to whisper/llama.cpp or Apple's MLX. It's a pretty straightforward transformer architecture.
To check how this works in practice you can check the Google Collab link, at the end we are cloning the voice from a Churchill's speech over radio.