How is the latency for real-time TTS? I remember kicking the tires several months back but went with one of the big 3 cloud providers since they had lower latency.
I also like that the cloud provider supports SSML and I can explicitly configure the emotion, whereas Playht dynamically changed the emotion based on context of the text.
The latency is not real-time yet but we're working on getting it to near real time. Regarding controlling the voice, we've added a few params like rate, voice guidance and temperature but for the most part the emotion is dependent on the text for now.
Low latency would open up a whole lot of interesting applications. Even Elevenlabs doesn't seem to have low enough latency in my testing to work as a convincing voice assistant or to, for example, work in real time on a phone call. For that we likely need QUIC or some kind of streaming protocol.
I also like that the cloud provider supports SSML and I can explicitly configure the emotion, whereas Playht dynamically changed the emotion based on context of the text.