Hacker Newsnew | past | comments | ask | show | jobs | submit | Alyx1337's commentslogin

Most engineers focus on the algorithm or the model in the AI space. Doing so, they forget the most essential and time-consuming part: ensuring your project is practical and accessible to your end-user. This post looks at two real-life use cases of how to build AI projects focused on the end-user.


Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".

Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here: https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...

There's a French defense company, Preligens, that actually does this currently


I imagine that within the next couple of years there's going to be a "general purpose vision" model (GPV? :)

More of a framework to perform the general purpose task of "recognize things in 30 (60? 120?) frames per second video and act on events in the video"


That was exactly my thought haha, I want Jarvis at home. You could easily modify my code to run a local LLM instead


Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.


Great! What do you guys have in mind in terms of products using these tools. Yeah unfortunately it's hard to shave on latency.


We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.

That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!


Yeah I had the same issue so I used (stole) this answer on StackOverflow: https://stackoverflow.com/questions/46734345/python-record-o... Basically there's a library that records until it detects a silence


Uh oh I hope I'm not in trouble


The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.


Deepgram advertised itself as being the fastest, and I wanted to focus on limiting response delay so I chose it. I hope I did not get misled.


well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.

Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.

I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time


unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].

[0]: https://github.com/ggerganov/whisper.cpp/tree/master/example...

[1]: https://github.com/ggerganov/whisper.cpp/blob/master/example...

[2]: https://whisper.ggerganov.com/stream/


have you tried it? i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.

but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.

can you use it? yes would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure. would i use it in my money-generating application? absolutely not.


Wonderful hack, the overall response latency is the only thing that hurts the UX, if you can get the response time down would be epic. Nice work.


Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially


How did you find these? I was literally looking for tutorials all day long and could not find something. These projects look insane!


Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .


Here is a video demo of the project: https://youtu.be/aIg4-eL9ATc?si=66ynl4Mlci9v76rU


Nice work! Very impressed.

Do you happen to know anything about any open source voice identification software?

I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.

Another idea I’ve had is using something professional with PTT:

https://sheepdogmics.com/products/quick-disconnect-mic-tubel...



Google Gemini was trained on audio and can generate audio directly. Whatever you build now will be replaced by a much better version soon.


Thanks! I don't know a lot about this but someone shared this local voice assistant in the comments: https://github.com/KoljaB/LocalAIVoiceChat Could be a good lead


Yeah github.com/KoljaB is quite a collection of stuff! I agree.

It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.


Thank you very much!


Hey guys! I work at Taipy; we are a Python library designed to create web applications using only Python. Some users had problems displaying charts based on big data, e.g., line charts with 100,000 points. We worked on a feature to reduce the number of displayed points while retaining the shape of the curve as much as possible and wanted to share how we did it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: