Hacker Newsnew | past | comments | ask | show | jobs | submit | nowittyusername's commentslogin

Do you have any resources or youtube videos that might also help someone understand the lcm context management a bit better. I think there's something to this, but i'm having trouble wrapping my head around it. i learn well with analogies and im trying to really grok the concept here. If there are other ways you could explain it it would be appreciated. mind you i have built my own agents from scratch so im not a total novice in these areas. my agents already manage context with sub-agents and multi layered conversational histories with RAG thrown in there. But i dont want to make wrong assumptions about your implementations and miss the nuanced important bits. regardless, ill try my best to reread the article and hash it out on my own, thanks for the paper.

Hi NWU,

We don't have any other materials yet, but let's see if this lands for you. I can run you through a couple simpler versions of the system, why they don't work, and how that informs our ultimate design.

The most basic part of the system is "two layers". Layer 1 is the "ground truth" of the conversation - the whole text the user sees. Layer 2 is what the model sees, i.e., the active context window.

In a perfect world, those would be the same thing. But, as you know, context lengths aren't long enough for that, so we can't fit everything from Layer 1 into Layer 2.

So instead we keep a "pointer" to the appropriate part of Layer 1 in Layer 2. That pointer takes the form of a summary. But it's not a summary designed to contain all information. It's more like a "label" that makes sure the model knows where to look.

The naive version of the system would allow the main model to expand Layer 2 summaries by importing all of the underlying data from Layer 1. But this doesn't work well, because then you just end up re-filling the Layer 2 context window.

So instead you let the main model clone itself, the clone expands the summary in its context (and can do this for multiple summaries, transforming each into the original uncompressed text), and then the clone returns whatever information the main thread requires.

Where this system would not fully match the capabilities of RLMs is that, by writing a script that calls itself e.g. thousands of times, an RLM has the ability to make many more recursive tool calls than can fit in a context window. So we fix that using operator-level recursion, i.e., we give the LLM a tool, map, that executes arbitrary recursion, without the LLM having to write a custom script to accomplish that.

Hope this helps!

- Clint


Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

Check out Sparrow-0. The demo shows an impressive ability to predict when the speaker has finished talking:

https://www.tavus.io/post/sparrow-0-advancing-conversational...


Thanks, ill read it now.

It feels like this is one of those areas where the last 10% of polish will take 90% of the effort

The 80/20 rule always wins

I wholeheartedly agree. In an age of talking heads. you will not hear from the people actually doing the thing. because they too busy doing the thing versus talking about it. now excuse me ima go back to doing the thing.


I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests.....


What screenreaders are you using to test the models with?


What's your experience at high speeds, with garbled speech artifacts and pronouncation accuracy?


With supertonic , or overall? If overall most do pretty well though some are funky, like suprano was so bad no matter what I did, so i had to rule that out from my top contenders on anything. supertonic was close to my number one choice for my agentic pipeline as it was soo insanely fast and quality was great, but it didnt have the other bells and whistles like some other models so i held that off for cpu only projects in the future. If you are gonna use it on a GPU I would suggest chatterbox or pocket tts. Chatterbox is my top contender as of now because it sounds amazing, has cloning and i got it down to 0.26 ttfa/ttsa once i quantized it and implemented pipecat in to it. pocket tts is probably my second choice for similar reasons.


>Also I think he is implementing his models wrong.

This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.

It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.


Minor nitpick, but you mean "tts" not "stt" both times.

Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?


yes sorry i mixed these up. supertonic is not the best sounding in my tests. it was by far the fastest, but its audio quality for something so fast was decent. if you wanted something that sounds better AND is also extremely fast pocket tts is the choice. amazing quality and also crazy fast on both gpu and cpu. if you care mainly about quality, chatterbox in my tests was best fit, but its slower then the others. qwen 3 tts was also great but its unisable as any real time agentic voice as its too slow. they havent relesed the code for streaming yet, once they release that this will be my top contender.


Thanks!


Just found this video ... it looks to sound and work -very- well. (RasPI & Onyx)

https://www.youtube.com/watch?v=bZ3I76-oJsc


Are you using them at 1000 wpm?


Supertonic is probably way faster then that, I wouldn't be surprised if measured it would be something like 14k wpm. On my 4090 I was getting about 175x real time while on cpu only it was 55x realtime. I stopped optimizing it but im sure it could be pushed further. Anyways you should check out their repo to test it yourself its crazy what that team accomplished!


Audio synthesis speed is one thing, but is the output _intelligible to a human_ at 1,000wpm? That's the sort of thing Eloquence is being used for, according to the article.


TTS has no intelligence bud. Its only something that transforms text to audio. And that is all that we are talking about here. neither the article or anyone else was discussing the whole stt > llm > tts pipeline.



Did you even read the article bud


Its an issue that is caused by many factors which are mostly related to the way our large scale societies are structured and ran, but I believe it will be solved very soon... By AI. first disclaimer, I am not advocating one way or the other for this just spelling out what I see on the horizon. Very soon AI systems will become a lot more sophisticated then your average chat bot. We will interact with them naturally through voice and they will become more capable in expressing the various nuances of the human speech, conversation cadence, etc... This is where humans will find solace. In fact i believe AI will be a humans best friend, lover, parent, child, etc.... as technology progresses and these things get embodied and so on. This year alone I expect the start of mass adoption of voice agents. But yeah, that's the way i see things play out. If I am right and things go this way, and you are interacting with these things, the smart move is to make sure you own the full stack 100% and not use the api related nonsense that will eventually brainwash you for this or that reason. If you are gonna dig a hole at least dig one that doesn't have the obvious traps in it.


Thanks for heads up, this looks really interesting and claimed speed is nuts..


This is perfect for me. I just started working on the voice related stuff for my agent framework and this will be of real use. Thanks.


Most people have not fully grasped how LLM's work and how to properly utilize agentic coding solutions. That is the reason for issues when it comes to vibe coders having low quality code. But that is not the limitation of technology but the user (at this stage). Basically think of it this way everyone is the grandma that has been handed a palm pilot to use to get things done. Grandma needs an iPhone not a palm pilot but the problem is that we are not in that territory yet. So now consider the people who were able to use the palm pilot very successfully and well, they were few and they were the exception, but they existed. Same here. I have been using coding agent for over 7 months now and have written zero lines of code, in fact I don't know how to code at all. But i have been able to architect very complex software projects from scratch. Text to speech , automated llm benchmarking systems for testing all possible llama.cpp sampling parameters and more, and now im building my own agentic framework from scratch. All of these things are possible and more without writing one line of code yourself. But it does require understanding how to use the technology well to get this done.


If you don't know how to code then you are not able to judge what your producing accurately.


here you go I open sourced one of the projects https://youtu.be/EyE5BrUut2o


All of the applications you mention could be scoped as beginner projects. I don't think they represent good proofs of capability.


Well why don't you look at it for yourself and tell me if this looks like a beginner project https://youtu.be/EyE5BrUut2o


Yes, this does look like a beginner project & exactly what i expected from someone who doesn't write code.


This is extremely simple software.

Claude is extremely verbose when it generates code, but this is something that should take a practicing software engineer an hour or so to write with a lot less code than Claude.

I like all the LLM coding tools, they're constantly getting better, but I remain convinced that all the people claiming massive productivity improvements are just not good software engineers.

I think the tools are finally at the point where they are generally a help, rather than a net waste of time for good engineers, but it's still marginal atm.


There is soo much marketing bs around these things it drives me nuts. and it doesn't help that the large labs and credible individuals like denis use these terms. "world models" are video generator with contextual memory but that term is soo misplaced. when one thinks of a "world model" you expect the thing to be at least be physics engine driven from its foundation, not the other way around where everything is generated and assumed at best.


This is the most blatant buy the competition move if i've ever seen one....


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: