Hacker Newsnew | past | comments | ask | show | jobs | submit | ldenoue's commentslogin

Check out something like LayerCode (Cloudflare based).

Or PipeCat Cloud / LiveKit cloud (I think they charge 1 cent per minute?)


I built a voice AI stack and background noise can be really helpful to a restaurant AI for example. Italian background music or cafe background is part of the brand. It’s not meant to make the caller believe this is not a bot but only to make the AI call on brand.


You can call it what ever you like, but to me this is deceptive.

Where is the difference between this and Indian support staff pretending to be in your vicinity by telling you about the local weather? Your version is arguably even worse because it can plausibly fool people more competently.


It doesn't have to be. You can configure your bot to great the user. E.g. "Aleksandra is not available at the moment, but I'm her AI assistant to help you book a table. How may I help you?"

So you're telling the caller that it is an AI, and yet you can have a pleasant background audio experience.


The problem with PipeCat and LiveKit (the 2 major stacks for building voice ai) is the deployment at scale.

That’s why I created a stack entirely in Cloudflare workers and durable objects in JavaScript.

Providers like AssemblyAI and Deepgram now integrate VAD in their realtime API so our voice AI only need networking (no CPU anymore).


let me get this straight, you are storing convo threads / context in DOs?

e.g. Deepgram (STT) via websocket -> DO -> LLM API -> TTS?


Yes DO let you handle long lived websocket connections. I think this is unique to Cloudflare. AWS or Google Cloud don't seem to offer these things (statefulness basically).

Same with TTS: some like Deepgram and ElevenLabs let you stream the LLM text (or chunks per sentence) over their websocket API, making your Voice AI bot really really low latency.


I developed a stack on Cloudflare workers where latency is super low and it is cheap to run at scale thanks to Cloudflare pricing.

Runs at around 50 cents per hour using AssemblyAI or Deepgram as the STT, Gemini Flash as LLM and InWorld.ai as the TTS (for me it’s on par with ElevenLabs and super fast)


Is AssemblyAI or Deepgram compatible with OpenAI Realtime API, esp. around voice activity detection and turn taking? How do you implement those?


I am not using speech to speech APIs like OpenAI, but it would be easy to swap the STT + LLM + TTS to using Realtime (or Gemini Live API for that matter).

OpenAI realtime voices are really bad though, so you can also configure your session to accept AUDIO and output TEXT, and then use any TTS provider (like ElevenLabs or InWord.ai, my favorite for cost) so generate the audio.


Do you have anything written up about how you're doing this? Curious to learn more...


I don't but I should open source this code. I was trying to sell to OEM though, that's why. Are you interested in licensing it?


Would be useful to get a preview of the code


Thanks for the feedback! I'm working on it now. Will push to GitHub soon with a basic Hono + D1 + Stripe setup you can actually run. I'll share it here when it's ready.


Which browser and computer ?


1. Firefox 145.0 (64-bit) with NoScript on Fedora 41 on Qubes OS.

2. Firefox 140.5esr with NoScript on PureOS on Librem 5.

In both cases, I only see "@2025 AppBilt" in the end of empty page.


In browser transcript beautification using a mix of small models (Bert, all-MiniLM-L6-v2 and T5) for restoring punctuation, finding chapter splits and generating the headers.


What is the name of the product / service?

Is it YouReadTube or is it Scribe?


YouReadTube is the new name because it’s easier to remember and also insert “read” on any YouTube link


Check out https://ldenoue.github.io/readabletranscripts/ and the website https://www.appblit.com/scribe that use Gemini to post correct the raw transcripts


Unless you fetch directly from your browser. It works by getting the YouTube json including the captions track. And then you get the baseUrl to download the xml.

I wrote this webapp that uses this method: it calls Gemini in the background to polish the raw transcript and produce a much better version with punctuation and paragraphs.

https://www.appblit.com/scribe

Open source with code to see how to fetch from YouTube servers from the browser https://ldenoue.github.io/readabletranscripts/



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: