A lot of our customers use us [0] for that, it works pretty well if executed properly. The voiceovers work best as inserts into an existing podcast. If you see the articles of major news orgs like NYT, they often have a (usually) machine narrated voiceover.
Yeah, neural codecs are pretty amazing. The most incredible part is that they can do compression well across the temporal domain, something which has been non-trivial.
Their sentence segmentation heuristics were not configured correctly. It's not an inherent limitation of the technology itself.
The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).
Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.
Not all spectral methods have such artifact. The type of artifacts you mention happens when you need to do phase retrieval or try to reconstruct waveforms from melspectrogram. Deepfilternet does spectral masking on the complex spectrogram so there is no need for phase retrieval.
> Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?
Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.
Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.
I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.
Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.
That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.
[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.
Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?
Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.
https://narrationbox.com