More

narrationbox · 2025-06-06T10:54:02 1749207242

Give us a try, I think we are what you are looking for

narrationbox · on Sept 10, 2024

A lot of our customers use us [0] for that, it works pretty well if executed properly. The voiceovers work best as inserts into an existing podcast. If you see the articles of major news orgs like NYT, they often have a (usually) machine narrated voiceover.

[0] https://narrationbox.com

deuxfitcheck · on Sept 12, 2024

Looks good! Do you guys have an API?

aparnaliza · on Sept 13, 2024

Yes, coming soon

narrationbox · on Sept 26, 2023

Yeah, neural codecs are pretty amazing. The most incredible part is that they can do compression well across the temporal domain, something which has been non-trivial.

narrationbox · on Sept 25, 2023

Their sentence segmentation heuristics were not configured correctly. It's not an inherent limitation of the technology itself.

The newer transformer based generators are a bit better in this regard (since they can maintain a longer context window, not just in short tiny snippets).

narrationbox · on Aug 11, 2023

Mel + multispeaker vocoder is very much a classic (tacotron era) TTS approach

narrationbox · on June 7, 2023

Since it does the signal processing in the Fourier domain, does this suffer from audio artefacts e.g. hissing in the output? Torch's inverse STFT uses Griffin-Lim which is probabilistic and if you don't train it sufficiently, you may sometimes get noise in the output.

https://pytorch.org/docs/stable/generated/torch.istft.html#t...

An alternative would be to use a vocoder network (or just target a neural speech codec like SoundStream).

thatsadude · on June 7, 2023

Not all spectral methods have such artifact. The type of artifacts you mention happens when you need to do phase retrieval or try to reconstruct waveforms from melspectrogram. Deepfilternet does spectral masking on the complex spectrogram so there is no need for phase retrieval.

narrationbox · on Jan 25, 2023

> Recognizes "human" and recognizes "desk". I sit on desk. Does AI mark it as a desk or as a chair?

Not an issue if the image segmentation is advanced enough. You can train the model to understand "human sitting". It may not generalize to other animals sitting but human action recognition is perfectly possible right now.

narrationbox · on May 18, 2022

Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.

Arbortheus · on May 18, 2022

I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.

kevin_thibedeau · on May 18, 2022

Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.

ben_w · on May 18, 2022

Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.

rob74 · on May 18, 2022

Ok, I get it, state of the art TTS uses AI techniques and so eats processing power, buuuuut seeing that much older efforts which ran on devices like old PCs, the Amiga, the original Macintosh, the Kindle etc. used much less CPU for speech that you could (mostly) understand without problems, it may be worth exploring if it's possible to write a better "dumb" (i.e. non-AI) speech synthesizer?

ben_w · on May 18, 2022

Better than the ones those systems already have? I assume they’ve already got some AI, because without AI, “minute” and “minute” get pronounced the same way because there’s no contextual clue to which instance is the unit of time and which is a fancy way of describing something as very small.

DonHopkins · on May 19, 2022

I'm still hoping that a human being can tell which of the four possible ways to pronounce the name of the English post-punk band, "The The".

https://en.wikipedia.org/wiki/The_The

https://www.youtube.com/watch?v=orIy18qIaCU

ben_w · on May 19, 2022

I have a soft spot for the Yorkshire pronunciation: https://www.youtube.com/watch?v=lzymb0YJp7E&t=160s

ccbccccbbcccbb · on May 18, 2022

The parent didn't mention real-time as a requirement. Offline rendering would well suffice.

SemanticStrengh · on May 18, 2022

28.7 million parameter is nothing for inference

snek_case · on May 18, 2022

Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.

narrationbox · on Oct 9, 2021

For the high end stuff no, but many of the lower tier jobs are under threat.

narrationbox · on April 23, 2021

We used to be in this field too (https://kloudtrader.com/narwhal). It is a very crowded market and monetisation is tricky.