Hacker Newsnew | past | comments | ask | show | jobs | submit | FuckButtons's commentslogin

A 3 bit quant will run on a 128gb MacBook Pro, it works pretty well.

A 3 bit quant is quite a lot weaker than the OpenRouter version the OP is using.

Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.

That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.

This is a partial realization of the idea, but, for a long running agent the proportion of noise increases linearly with the session length, unless you take an appropriately large machete to the problem you’re still going to wind up with sub optimal results.

Yeah, I'd definitely like to be able to edit my context a lot more. And once you consider that you start seeing things in your head like "select this big chunk of context and ask the model to simply that part", or do things like fix the model trying to ingest too many tokens because it dumped a whole file in that it didn't realize was going to be as large as it was. There's about a half-dozen things like that that are immediately obviously useful.

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.

This seems misguided, you have to cache a prefix due to attention.

Bigger model wins as long as the quantization was done properly.

Not entirely true, it’s random access within the relevant subset of experts and since concepts are clustered you actually have a much higher probability of repeatedly accessing the same subset of experts more frequently.

It’s called mixture of experts but it’s not that concepts map cleanly or even roughly to different experts. Otherwise you wouldn’t get a new expert on every token. You have to remember these were designed to improve throughput in cloud deployments where different GPUs load an expert. There you precisely want each expert to handle randomly to improve your GPU utilization rate. I have not heard anyone training local MoE models to aid sharding.

is there anywhere good to read/follow to get operational clarity on this stuff?

my current system of looking for 1 in 1000 posts on HN or 1 in 100 on r/locallama is tedious.


Ask any of the models to explain this to you

Did he stutter?

Weimar Germany was very socially liberal, homosexuality was socially accepted, legal rights for women were the same as for men, and all of that definitely went away quite quickly.

That’s because it’s superstition.

Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.

Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.

The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.

If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.


I actually have a prompt optimizer skill that does exactly this.

https://github.com/solatis/claude-config

It’s based entirely off academic research, and a LOT of research has been done in this area.

One of the papers you may be interested in is “emotion prompting”, eg “it is super important for me that you do X” etc actually works.

“Large Language Models Understand and Can be Enhanced by Emotional Stimuli”

https://arxiv.org/abs/2307.11760


Thanks for sharing! I've been gravitating towards this sort of workflow already - just seems like the right approach for these tools.

> If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

"Add tests to this function" for GPT-3.5-era models was much less effective than "you are a senior engineer. add tests for this function. as a good engineer, you should follow the patterns used in these other three function+test examples, using this framework and mocking lib." In today's tools, "add tests to this function" results in a bunch of initial steps to look in common places to see if that additional context already exists, and then pull it in based on what it finds. You can see it in the output the tools spit out while "thinking."

So I'm 90% sure this is already happening on some level.


But can you see the difference if you only include "you are a senior engineer"? It seems like the comparison you're making is between "write the tests" and "write the tests following these patterns using these examples. Also btw you’re an expert. "

Today’s llms have had a tonne of deep rl using git histories from more software projects than you’ve ever even heard of, given the latency of a response I doubt there’s any intermediate preprocessing, it’s just what the model has been trained to do.

> That’s because it’s superstition.

This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.

This is why we see a new Markdown format every week, "skills", "benchmarks", and other useless ideas, practices, and measurements. Consider just how many "how I use AI" articles are created and promoted. Most of the field runs on anecdata.

It's not until someone actually takes the time to evaluate some of these memes, that they find little to no practical value in them.[1]

[1]: https://news.ycombinator.com/item?id=47034087


> This field is full of it. Practices are promoted by those who tie their personal or commercial brand to it for increased exposure, and adopted by those who are easily influenced and don't bother verifying if they actually work.

Oh, the blasphemy!

So, like VB, PHP, JavaScript, MySQL, Mongo, etc? :-)


The superstitious bits are more like people thinking that code goes faster if they use different variable names while programming in the same language.

And the horror is, once in a long while it is true. E.g. where perverse incentives cause an optimizing compiler vendor to inject special cases.


i suppose we will just have to write an English to pedantry compiler.

A common technique is to prompt in your chosen AI to write a longer prompt to get it to do what you want. It's used a lot in image generation. This is called 'prompt enhancing'.


I think "understand this directory deeply" just gives more focus for the instruction. So it's like "burn more tokens for this phase than you normally would".

> I think there are finely-tuned social algorithms that we innately follow.

That would explain why I can’t do small talk, those are not innate to everyone.


Wasn’t to me either. It’s a learned skill that you can study and practice. I am only child. About a decade ago I saw one of two ways to make above my 2nd tier city enterprise dev wages - about $150K - either “grind leetCode and work for a FAANG” (r/cscareerquestions) or go into customer facing consulting where I would be required to do the business dinners and small talk.

I chose the latter. At 45+, there is no age discrimination in consulting - I still do hands on keyboard coding + cloud. Even before I got into consulting (working full time for consulting companies), I had roles inside companies where I interviewed with new to company directors/CTOs who were looking for someone who could get things done not reversing a b tree on the whiteboard. I had to learn how to talk. I haven’t had a coding interview since 2012 and I’ve worked for 6 companies since then


I mean, you can see llms shooting themselves in the foot with c++ right now, just ask them to write it.

Crowdstrike level shot in the foot.

They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: