Hacker Newsnew | past | comments | ask | show | jobs | submit | indigodaddy's commentslogin

Hi, so I'm looking for an stt that can happen on a server/cron, that will use a small local model (I have 4 vCPU threadripper CPU only and 20G ram on the server) and be able to transcribe from remote audio URLs (preferably, but I know that local models probably don't have this feature so will have to do something like curl the audio down to memory or /tmp and then transcribe and then remove the file etc).

Have any thoughts?


I’ve no thoughts on that unfortunately.


How would ipv6 solve this specific problem?

They're aiming to reduce complexity for the user wherever they can, so their efforts around this make total sense for the platform they've created.

If anyone wants to see how exe.dev works, a Marimo dev put together a really nice video:

https://www.youtube.com/watch?v=bV60dwhL2x4


I used Opus 4.5 for 99.9% of this project. Take that as you will.

https://github.com/jgbrwn/vibebin

vibebin is an Incus/LXC-based platform for self-hosting persistent AI coding agent sandboxes with Caddy reverse proxy and direct SSH routing to containers (suitable for VS Code remote ssh). Create and host your vibe-coded apps on a single VPS/server.

If anyone wants to test or provide some feedback that would be great. Core functionality works but there's likely to be bugs.

My intent for the project was for the tinkerer/hobbyist or even not super technical person to put this on a VPS and start just doing their own thing/experimenting/tinkering/learning etc.

I had so much fun working on this project, completely reinvigorated by it tbh.

I am just a Linux sysadmin and not a programmer at all (~just~ smart enough to figure stuff out though:) ) and I have to say the excitement and energy that was brought into me working on this project was nothing like I've ever experienced before. It makes me so optimistic about this future that we are either embracing or fending off (depending on your mindset).


Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)

No idea at all, but my guess is it would work but be a bit slow.

You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.


the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.

anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.


What happens to all that content/work when this sort of thing happens? Just gone gone or?

Usually gone or retooled for something else. Even just within the Prince of Persia series, they had a game called Prince of Persia: Assassins that was canceled and turned into Assassin's Creed. There was also a sequel to Prince of Persia 2008 in development that was cancelled and never showed up again. You can even find footage from both of these games online, but they will never see the light of day unless someone leaks them (which does happen sometimes).

Thanks

How does the cloning compare to pocket TTS?

It’s uncanny good. I prefer it to pocket, but then again pocket is much smaller and for realtime streaming.

Ah right I guess I meant for instant which I assume qwen can't do

Pocket TTS is much smaller: 100M parameters versus 600–1800M.

Ah right so I guess qwen3-tts isn't going to work for cpu-only like pocket TTS can(?)

The current code doesn't appear very optimized. Running on CPU-only it only uses four threads for example, nowhere close to saturating all my cores.

As a result it's dog slow on CPU only, like 3-4 minutes to produce a 3 second clip, and still significantly less than real-time on my 5090 using only 30% of the GPU.


Stumbled on this, not affiliated

Not exactly sandboxing as much as just working within LXC containers, I've been building this:

https://github.com/jgbrwn/vibebin


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: