This is very impressive work. Could this model be capable of doing streamed inpu...

fxtentacle · on Aug 10, 2022

No. The best you can do with this AI architecture would be something like NVIDIA's NeMo, meaning you group the input audio stream into 2s blocks with 1s overlap and then run the speech recognition on that.

If you want real-time speech recognition with less than 0.5s of delay between speaking the word and it being fully recognized, then one needs to implement a different architecture. And that one is much more difficult and expensive to train that this one (which was already expensive).

That said, I want a fully offline and privacy-respecting voice assistant myself.

So attempting to build the AI for real-time streamed live English speech recognition will be my next project. I plan to ship it as an OpenGL-accelerated binary with WebRTC server so that others can easily combine my recognition with their logic. But it probably won't be free since I'm looking at >$100k in compute costs to build it. In any case, here's the waiting list: https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5...

phkahler · on Aug 10, 2022

>> But it probably won't be free since I'm looking at >$100k in compute costs to build it.

How about crowd funding it? Your previous work should be enough to convince people it's worth contributing to.

fxtentacle · on Aug 10, 2022

Yeah, I'm looking into government programs such as the EU "Prototype Fund", too. But the issue with crowdfunding is that if I want to raise $100k on Kickstarter, I need to spend $10k for an agency and another $20k for ads to promote the campaign. So it's quite wasteful (30% just for marketing) unless you already have a large audience willing to pay, which I don't have.

So I believe my best bet might be to partner up with a larger company who will pay for development and/or just charging users for a license. Nuance's Dragon Home is $200 and their Pro version is $500, so there's a lot of room for me to be cheaper while still reaching $100k in revenue with a realistic number of users.

Birch-san · on Aug 10, 2022

You might be eligible to use Google's TPU Research Cloud for free, provided you publicize your results? https://sites.research.google/trc/about/

Otherwise, perhaps you could ask LAION on #compute-allocation?

fxtentacle · on Aug 10, 2022

Thanks for those excellent actionable suggestions :)

Comments like these are why "Show HN" can be so rewarding (despite all the pedantry about the submission title which I can't change anymore anyway).

a2128 · on Aug 10, 2022

Thanks for the information! I've been looking to build an accessibility tool for a deaf community so that they can see live captions of conversations, but some of the existing solutions I've tried seem to lag behind in conversational speech accuracy, or they're difficult/impossible to fine-tune with the community-specific words and phrases.

fxtentacle · on Aug 10, 2022

This type of "Digital Therapeutics" might be paid for by German health insurance companies. For example, https://gaia-group.com/en/ appears to be a successful provider of medical apps.

If you don't mind, please email me at moin@deutscheki.de and explain in a bit more detail what the needs of that deaf community are. Maybe I can forward that to the right people to get my government to pay for the app that you wish to see developed.

I mean I agree with you, it certainly would increase life satisfaction for deaf people if they could "listen in" into conversations to know what others are gossiping about.