Hi, so I'm looking for an stt that can happen on a server/cron, that will use a small local model (I have 4 vCPU threadripper CPU only and 20G ram on the server) and be able to transcribe from remote audio URLs (preferably, but I know that local models probably don't have this feature so will have to do something like curl the audio down to memory or /tmp and then transcribe and then remove the file etc).
vibebin is an Incus/LXC-based platform for self-hosting persistent AI coding agent sandboxes with Caddy reverse proxy and direct SSH routing to containers (suitable for VS Code remote ssh). Create and host your vibe-coded apps on a single VPS/server.
If anyone wants to test or provide some feedback that would be great. Core functionality works but there's likely to be bugs.
My intent for the project was for the tinkerer/hobbyist or even not super technical person to put this on a VPS and start just doing their own thing/experimenting/tinkering/learning etc.
I had so much fun working on this project, completely reinvigorated by it tbh.
I am just a Linux sysadmin and not a programmer at all (~just~ smart enough to figure stuff out though:) ) and I have to say the excitement and energy that was brought into me working on this project was nothing like I've ever experienced before. It makes me so optimistic about this future that we are either embracing or fending off (depending on your mindset).
the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.
anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.
Usually gone or retooled for something else. Even just within the Prince of Persia series, they had a game called Prince of Persia: Assassins that was canceled and turned into Assassin's Creed. There was also a sequel to Prince of Persia 2008 in development that was cancelled and never showed up again. You can even find footage from both of these games online, but they will never see the light of day unless someone leaks them (which does happen sometimes).
The current code doesn't appear very optimized. Running on CPU-only it only uses four threads for example, nowhere close to saturating all my cores.
As a result it's dog slow on CPU only, like 3-4 minutes to produce a 3 second clip, and still significantly less than real-time on my 5090 using only 30% of the GPU.
Have any thoughts?
reply