More

Alyx1337 · on Feb 17, 2024

Most engineers focus on the algorithm or the model in the AI space. Doing so, they forget the most essential and time-consuming part: ensuring your project is practical and accessible to your end-user. This post looks at two real-life use cases of how to build AI projects focused on the end-user.

Alyx1337 · on Dec 18, 2023

Exactly my thought, I was like "Jarvis has got to be just a 2030 version of an LLM".

Yeah I actually considered making a spotter AI using computer vision in a game like ARMA 3 or Squad but kind of difficult. I made a spotter for ground vehicles on aerial imagery using YOLOv5 here: https://github.com/AlexandreSajus/Military-Vehicles-Image-Re...

There's a French defense company, Preligens, that actually does this currently

bloopernova · on Dec 18, 2023

I imagine that within the next couple of years there's going to be a "general purpose vision" model (GPV? :)

More of a framework to perform the general purpose task of "recognize things in 30 (60? 120?) frames per second video and act on events in the video"

Alyx1337 · on Dec 18, 2023

That was exactly my thought haha, I want Jarvis at home. You could easily modify my code to run a local LLM instead

ohthehugemanate · on Dec 18, 2023

Actually there are a few LLM wrappers around that use the openai API spec (localai is a good one)... so you could just allow a configurable openai endpoint URI and technically users can swap in any model.

Alyx1337 · on Dec 18, 2023

Great! What do you guys have in mind in terms of products using these tools. Yeah unfortunately it's hard to shave on latency.

TuringNYC · on Dec 18, 2023

We do live in-studio briefings 3x/wk. These are both in-person and live-broadcast. The first thing we did was add an AI Co-Briefer who sits on the panel. The LLM latency makes it a bit hard, but it was a good experiment. The Deepgram worked brilliantly well with transcription across the entire studio, even for un-microphoned guest participants.

That live broadcast created a lot of buzz and numerous other use cases have popped up across the company. I'm working on a tech blog showcase next week to show it off on HN hopefully!

Alyx1337 · on Dec 18, 2023

Yeah I had the same issue so I used (stole) this answer on StackOverflow: https://stackoverflow.com/questions/46734345/python-record-o... Basically there's a library that records until it detects a silence

Alyx1337 · on Dec 18, 2023

Uh oh I hope I'm not in trouble

Mountain_Skies · on Dec 18, 2023

The way these things usually (but not always) work is they'll send you a cease and desist letter if they intend on bothering you. Change the name at that point and you're usually good.

Alyx1337 · on Dec 18, 2023

Deepgram advertised itself as being the fastest, and I wanted to focus on limiting response delay so I chose it. I hope I did not get misled.

java_beyb · on Dec 18, 2023

well, deepgram might be the fastest among cloud-dependent APIs, like Speechmatics and Assembly AI mentioned above. -but- it cannot be faster than local or smaller models as you mentioned.

Among local solutions, Whisper SDK doesn't support streaming, I haven't seen any good workarounds or successfully implemented it. VOSK, DeepSpeech, Kaldi, et al were good once upon a time... Picovoice seems to be doing well.

I was planning to work on this: https://picovoice.ai/blog/chatgpt-ai-virtual-assistant-in-py... using Eleven Labs and Cheetah. Hope I can crave some time

jkachmar · on Dec 19, 2023

unless i'm misunderstanding `whisper.cpp` seems to support streaming & the repository includes a native example[0] and a WASM example[1] with a demo site[2].

[0]: https://github.com/ggerganov/whisper.cpp/tree/master/example...

[1]: https://github.com/ggerganov/whisper.cpp/blob/master/example...

[2]: https://whisper.ggerganov.com/stream/

java_beyb · on Dec 19, 2023

have you tried it? i mean for fun, it wouldn't hurt for sure and ggerganov is doing amazing stuff. kudos to him.

but whisper is designed to process audio files in 30-second batches if I'm not mistaken. it's been a while since whisper released, lol. These workarounds make the window smaller but it doesn't change the fact that they're workarounds. you can adjust, modify, or manipulate the model. You can't write or train it from scratch. check out the issues referring to the real-time transcription in the repo.

can you use it? yes would it perform better than Deepgram? -although it's an API and probably not the best API- I am not sure. would i use it in my money-generating application? absolutely not.

cloudking · on Dec 18, 2023

Wonderful hack, the overall response latency is the only thing that hurts the UX, if you can get the response time down would be epic. Nice work.

Alyx1337 · on Dec 18, 2023

Thanks! There are ways to shave off the latency: hosting locally, using quantized/smaller models, streaming data instead of doing the tasks sequentially

Alyx1337 · on Dec 18, 2023

How did you find these? I was literally looking for tutorials all day long and could not find something. These projects look insane!

Jayakumark · on Dec 18, 2023

Yes, was searching for Realtime STT and got a hit on GitHub , then looked at his other projects and found he builts up on his STT and TTS projects, it’s just 2 second latency on Local voice chat. Which is very good .

Alyx1337 · on Dec 18, 2023

Here is a video demo of the project: https://youtu.be/aIg4-eL9ATc?si=66ynl4Mlci9v76rU

alchemist1e9 · on Dec 18, 2023

Nice work! Very impressed.

Do you happen to know anything about any open source voice identification software?

I’ve noticed with ChatGPT voice and any other voice driven assistant that a massive problem is the background voices and noise. One solution could be advanced pre-processing to ID your voice only.

Another idea I’ve had is using something professional with PTT:

https://sheepdogmics.com/products/quick-disconnect-mic-tubel...

Jayakumark · on Dec 18, 2023

Check whether this can help https://github.com/resemble-ai/resemble-enhance/tree/main

visarga · on Dec 18, 2023

Google Gemini was trained on audio and can generate audio directly. Whatever you build now will be replaced by a much better version soon.

Alyx1337 · on Dec 18, 2023

Thanks! I don't know a lot about this but someone shared this local voice assistant in the comments: https://github.com/KoljaB/LocalAIVoiceChat Could be a good lead

alchemist1e9 · on Dec 18, 2023

Yeah github.com/KoljaB is quite a collection of stuff! I agree.

It all seems your vision of JARVIS, which I share completely but haven't accomplished what you have, again excellent work and thank you for sharing, is very attainable. Probably combining your work along with KoljaB is very promising.

Alyx1337 · on Dec 18, 2023

Thank you very much!

Alyx1337 · on Dec 12, 2023

Hey guys! I work at Taipy; we are a Python library designed to create web applications using only Python. Some users had problems displaying charts based on big data, e.g., line charts with 100,000 points. We worked on a feature to reduce the number of displayed points while retaining the shape of the curve as much as possible and wanted to share how we did it.