Hacker Newsnew | past | comments | ask | show | jobs | submit | narrationbox's commentslogin

Give us a try, I think we are what you are looking for

https://narrationbox.com


A lot of our customers use us [0] for that, it works pretty well if executed properly. The voiceovers work best as inserts into an existing podcast. If you see the articles of major news orgs like NYT, they often have a (usually) machine narrated voiceover.

[0] https://narrationbox.com


Looks good! Do you guys have an API?


Yes, coming soon


Yeah, neural codecs are pretty amazing. The most incredible part is that they can do compression well across the temporal domain, something which has been non-trivial.


Their sentence segmentation heuristics were not configured correctly. It's not an inherent limitation of the technology itself.

The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).


Mel + multispeaker vocoder is very much a classic (tacotron era) TTS approach


Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.

https://pytorch.org/docs/stable/generated/torch.istft.html#t...

An alternative would be to use a vocoder network (or just target a neural speech codec like SoundStream).


Not all spectral methods have such artifact. The type of artifacts you mention happens when you need to do phase retrieval or try to reconstruct waveforms from melspectrogram. Deepfilternet does spectral masking on the complex spectrogram so there is no need for phase retrieval.


> Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.


Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.


I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.


Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.


Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.


Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?


Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.


I'm still hoping that a human being can tell which of the four possible ways to pronounce the name of the English post-punk band, "The The".

https://en.wikipedia.org/wiki/The_The

https://www.youtube.com/watch?v=orIy18qIaCU


I have a soft spot for the Yorkshire pronunciation: https://www.youtube.com/watch?v=lzymb0YJp7E&t=160s


The parent didn't mention real-time as a requirement. Offline rendering would well suffice.


28.7 million parameter is nothing for inference


Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.


For the high end stuff no, but many of the lower tier jobs are under threat.


We used to be in this field too (https://kloudtrader.com/narwhal). It is a very crowded market and monetisation is tricky.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: