Most engineers focus on the algorithm or the model in the AI space. Doing so, they forget the most essential and time-consuming part: ensuring your project is practical and accessible to your end-user.
This post looks at two real-life use cases of how to build AI projects focused on the end-user.
Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".
Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here:
https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...
There's a French defense company, Preligens, that actually does this currently
Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.
We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.
That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!
The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.
well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.
Among local solutions,
Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it.
VOSK, DeepSpeech, Kaldi, et al were good once upon a time...
Picovoice seems to be doing well.
unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].
have you tried it?
i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.
but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.
can you use it? yes
would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure.
would i use it in my money-generating application? absolutely not.
Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially
Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .
Do you happen to know anything about any open source voice identification software?
I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.
Another idea I’ve had is using something professional with PTT:
Yeah github.com/KoljaB is quite a collection of stuff! I agree.
It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.
Hey guys! I work at Taipy; we are a Python library designed to create web applications using only Python. Some users had problems displaying charts based on big data, e.g., line charts with 100,000 points. We worked on a feature to reduce the number of displayed points while retaining the shape of the curve as much as possible and wanted to share how we did it.