Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wrote a bit about this the other day: https://simonwillison.net/2025/Jun/27/context-engineering/

Drew Breunig has been doing some fantastic writing on this subject - coincidentally at the same time as the "context engineering" buzzword appeared but actually unrelated to that meme.

How Long Contexts Fail - https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho... - talks about the various ways in which longer contexts can start causing problems (also known as "context rot")

How to Fix Your Context - https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.... - gives names to a bunch of techniques for working around these problems including Tool Loadout, Context Quarantine, Context Pruning, Context Summarization, and Context Offloading.



Drew Breunig's posts are a must read on this. This is not only important for writing your own agents, it is also critical when using agentic coding right now. These limitations/behaviors will be with us for a while.


They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.


Drew isn't using that term in a military context, he's using it in a gaming context. He defines what he means very clearly:

> The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round.

In the military you don't select your abilities before entering a level.


the military definitely do use the term loadout. It can be based on mission parameters e.g. if armored vehicles are expected, your loadout might include more MANPATS. It can also refer to the way each soldier might customize their gear, e.g. cutaway knife in boot or on vest, NODs if extended night operations expected (I know, I know, gamers would like to think you'd bring everything, but in real life no warfighter would want to carry extra weight unnecessarily!), or even the placement of gear on their MOLLE vests (all that velcro has a reason).


Nobody is disputing that. We are saying that the statement "The term 'loudout' is a gaming term" can be true at the same time.


i think that software engineers using this terminology might be envisioning themselves as generals, not infantry :)


>Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology

Does he pretend to give the etymology and ultimately origin of the term, or just where he or other AI-discussions found it? Because if it's the latter, he is entitled to call it a "gaming" term, because that's what it is to him and those in the discussion. He didn't find it in some military manual or learned it at boot camp!

But I would mostly challenge this mistake, if we admit it as such, is "significant" in any way.

The origin of loadout is totally irrelevant to the point he makes and the subject he discusses. It's just a useful term he adopted, it's history is not really relevant.


This seems like a rather unimportant type of mistake, especially because the definition is still accurate, it’s just the etymology isn’t complete.


It _is_ a gaming term - it is also a military term (from which the gaming term arose).


> They might be good reads on the topic but Drew makes some significant etymological mistakes. For example loadout doesn't come from gaming but military terminology. It's essentially the same as kit or gear.

Doesn't seem that significant?

Not to say those blog posts say anything much anyway that any "prompt engineer" (someone who uses LLMs frequently) doesn't already know, but maybe it is useful to some at such an early stage of these things.


this is textbook pointless pedantry. I'm just commenting to find it again in the future.


Click on the 'time' part of the comment header, then you can 'favorite' the comment. That way you can avoid adding such comments in the future.


For visual art I feel that the existing approaches in context engineering are very much lacking. An Ai understands well enough such simple things as content (bird, dog, owl etc), color (blue green etc) and has a fair understanding of foreground/background. However, the really important stuff is not addressed.

For example: in form, things like negative shape and overlap. In color contrast things like Ratio contrast and dynamic range contrast. Or how manipulating neighboring regional contrast produces tone wrap. I could go on.

One reason for this state of affairs is that artists and designers lack the consistent terminology to describe what they are doing (though this does not stop them from operating at a high level). Indeed, many of the terms I have used here we (my colleagues and I) had to invent ourselves. I would love to work with an AI guru to address this developing problem.


> artists and designers lack the consistent terminology to describe what they are doing

I don't think they do. It may not be completely consistent, but open any art book and you find the same thing being explained again and again. Just for drawing humans, you will find emphasis on the skeleton and muscle volume for forms and poses, planes (especially the head) for values and shadows, some abstract things like stability and line weight, and some more concrete things like foreshortening.

Several books and course have gone over those concepts. They are not difficult to explain, they are just difficult to master. That's because you have to apply judgement for every single line or brush stroke deciding which factors matter most and if you even want to do the stroke. Then there's the whole hand eye coordination.

So unless you can solve judgement (which styles derive from), there's not a lot of hope there.

ADDENDUM

And when you do a study of another's work, it's not copying the data, extracting colors, or comparing labels,... It's just studying judgement. You know the complete formula from which a more basic version is being used for the work, and you only want to know the parameters. Whereas machine training is mostly going for the wrong formula with completely different variables.


I concur that there is, on some matters, a general agreement in art books. However, certainly it does not help that there is so much inconsistency of terminology. For example: the way that hue and color are so frequently used interchangeably, likewise lightness, brightness, tone and value.

What bothers me more is that so much truly important material is not being addressed as explicitly as it should be. For example: the exaggeration of contrast on which so much art relies exists in two dimensions: increase of difference and decrease of difference.

This application of contrast/affinity is a general principle that runs through the entirety of art. Indeed, I demonstrate it to my students by showing its application in Korean TV dramas. The only explicit mention I can find of this in art literature is in the work of Ruskin, nearly 200 years ago!

Even worse is that so much very important material is not being addressed at all. For example, a common device that painters employ is to configure the neighboring regional contrast of a form can be light against dark on one edge and dark against light on the opposing edge. In figurative paintings and in classic portrait photography this device is almost ubiquitous, yet as far as I am able to determine no one has named it or even written about it. We were obliged to name it ourselves (tone wrap).

> They are not difficult to explain, they are just difficult to master.

Completely agree that they can be difficult to master. However, a thing cannot be satisfactorily explained unless there is consistent (or even existent) terminology for that thing.

> So unless you can solve judgement (which styles derive from)

Nicely put.


> For example, a common device that painters employ is to configure the neighboring regional contrast of a form can be light against dark on one edge and dark against light on the opposing edge.

I'm not fully sure of what you means. If we take the following example, are you talking about the neck and the collar of the girl?

https://i.pinimg.com/originals/ea/70/0b/ea700b6a0b366c13187e...

https://fr.pinterest.com/pin/453596993695189968/

I think the name of the concept is "edge control" (not really original). You can find some explanation here

https://www.youtube.com/watch?v=zpSlGmbUB08

To keep it short, there's no line in reality. So while you can use them when sketching, they are pretty crude, kinda like a piano with only 2 keys. The best thing is edges, meaning the delimitation between two contrasting area. If you're doing grayscale, your areas are values (light and shadow) and it's pretty easy. Once you add color, there's more dimension to play with and it became very difficult (warm and cold color, atmospheric colors, brush stroke that gives the illusion of details,...).

Again, this falls under the things that are easy to explain, but take a while to be able to observe it and longer to reproduce it.

There's a book called "Color and Light" by James Gurney that goes in depth about all of these. There's a lot of parameters that goes inside a brush stroke in a specific area of a painting.


> I'm not fully sure of what you means. If we take the following example, are you talking about the neck and the collar of the girl?

Yes... that's exactly it. It is also described in our teaching material here, (half way down the page):

https://rmit.instructure.com/courses/87565/pages/structural-...

Rembrandt was an avid user of this technique. In his portraits, one little trick he almost always used was to ensure that there was no edge contrast whatsoever in at least one region, usually located near the bottom of the figure. This served to blend the figure into the background and avoid the flat effect that would have happened had he not used it. In class I call this 'edge loss'. An equivalent in drawing is the notion of 'open lines' whereby silhouette lines are deliberately left open at select points.

> I think the name of the concept is "edge control" (not really original). You can find some explanation here.

I am aware of the term 'edge control' though I have not heard it used in this context. I feel that the term is too general to describe what is happening in the (so-called) tone wrap.

To extend the principle, wrap is an important concept in spatial rendering (painting, photography, filmmaking etc) and is a cousin of overlap. Simply... both serve to enhance form.

> To keep it short, there's no line in reality.

True that. I learned a lot about lines from reading about non-photorealistic rendering in 3D. There are some great papers on this subject (below) though I feel there is still work to be done.

Cole, Forrester, et al. "How well do line drawings depict shape?." ACM SIGGRAPH 2009 papers. 2009. 1-9.

Cole, Forrester, et al. "Where do people draw lines?." ACM SIGGRAPH 2008 papers. 2008. 1-11.

I made a stab at summarizing their wisdoms here:

https://rmit.instructure.com/courses/87565/pages/drawing-lin...

> There's a book called "Color and Light" by James Gurney that goes in depth about all of these. There's a lot of parameters that goes inside a brush stroke in a specific area of a painting.

Looking at it now. Any writer who references the Hudson River School is a friend of mine.


I'm surprised there isn't already an ecosystem of libraries that just do this. When building agents you either have to roll your own or copy an algorithm out of some article.

I'd expect this to be a lot more plug and play, and as swappable as LLMs themselves by EOY, along with a bunch of tooling to help with observability, A/B testing, cost and latency analysis (since changing context kills the LLM cache), etc.


Or maybe it's that each of these things is pretty simple in itself. Clipping context is one line of code, summarizing could be a couple lines to have an LLM summarize it for you, etc. So not substantial enough for a formal library. Whereas the combinations of these techniques is very application dependent, so not reusable enough to warrant separating as an independent library.

Or maybe it just hasn't matured yet and we'll see more of it in the future. We'll see.


Though in a way, this feels similar to things like garbage collection, disk defragmentation, or even query planning. Yes, you could build libraries that do these sorts of things for you, but in all likelihood the LLM providers will embed custom-built versions of them that have been battle tested and trained thoroughly to interop well with the corresponding LLM. So whole there could still be an ecosystem, it would likely be a fairly niche thing for very specific use cases or home-grown LLMs.

Maybe something like the equivalent of AWS Firecracker for whatever the equivalent of AWS Lambda is in the future LLM world.


“A month-long skill” after which it won’t be a thing anymore, like so many other.


Most of the LLM prompting skills I figured out ~three years ago are still useful to me today. Even the ones that I've dropped are useful because I know that things that used to be helpful aren't helpful any more, which helps me build an intuition for how the models have improved over time.


I agree with you, but would echo OP's concern, in a way that makes me feel like a party pooper, but, is open about what I see us all expressing squeamish-ness about.

It is somewhat bothersome to have another buzz phrase. I don't why we are doing this, other than there was a Xeet from the Shopify CEO, QT'd approvingly by Karpathy, then its written up at length, and tied to another set of blog posts.

To wit, it went from "buzzphrase" to "skill that'll probably be useful in 3 years still" over the course of this thread.

Has it even been a week since the original tweet?

There doesn't seem to be a strong foundation here, but due to the reach potential of the names involved, and their insistence on this being a thing while also indicating they're sheepish it is a thing, it will now be a thing.

Smacks of a self-aware version of Jared Friedman's tweet re: watching the invention of "Founder Mode" was like a startup version of the Potsdam Conference. (which sorted out Earth post-WWII. and he was not kidding. I could not even remember the phrase for the life of me. Lasted maybe 3 months?)


Sometimes buzzwords turn out to be mirages that disappear in a few weeks, but often they stick around.

I find they takeoff when someone crystallizes something many people are thinking about internally, and don’t realize everyone else is having similar thoughts. In this example, I think the way agent and app builders are wrestling with LLMs is fundamentally different than chatbots users (it’s closer to programming), and this phrase resonates with that crowd.

Here’s an earlier write up on buzzwords: https://www.dbreunig.com/2020/02/28/how-to-build-a-buzzword....


I agree - what distinguishes this is how rushed and self-aware it is. It is being pushed top down, sheepishly.

EDIT: Ah, you also wrote the blog posts tied to this. It gives 0 comfort that you have a blog post re: building buzz phrases in 2020, rather, it enhances the awkward inorganic rush people are self-aware of.


I studied linguistic anthropology, in addition to CS. Been at it since 2002.

And I wrote the first post before the meme.


I've read these ideas a 1000 times, I thought it was the most beautiful core of the "Sparks of AGI" paper. (6.2)

We should be able to name the source of this sheepishness and have fun with that we are all things at once: you can be a viral hit 2002 super PhD with expertise in all areas involved in this topic that has brought pop attention onto something important, and yet, the hip topic you feel centered on can also make people's eyes roll temporarily. You're doing God's work. The AI = F(C) thing is really important. Its just, in the short term, it will feel like a buzzword.

This is much more about me playing with, what we can reduce to, the "get off my lawn!" take. I felt it interesting to voice because it is a consistent undercurrent in the discussion and also leads to observable absurdities when trying to describe it. It is not questioning you, your ideas, or work. It has just come about at a time when things become hyperreal hyperquickly and I am feeling old.


The way I see it we're trying to rebrand because the term "prompt engineering" got redefined to mean "typing prompts full of stupid hacks about things like tipping and dead grandmas into a chatbot".


It helps that the rebrand may lead some people to believe that there are actually new and better inputs into the system rather than just more elaborate sandcastles built in someone else's sandbox.


If that's what it takes to make good results, then it's respectable work even if the details are stupid.


While researching the above posts Simon linked, I was struck by how many of these techniques came from the pre-ChatGPT era. NLP researchers have been dealing with this for awhile.


Many people figured it out two-three years ago when AI-assisted coding basically wasn't a thing, and it's still relevant and will stay relevant. These are fundamental principles, all big models work similarly, not just transformers and not just LLMs.

However, many fundamental phenomena are missing from the "context engineering" scope, so neither context engineering nor prompt engineering are useful terms.


What exactly month-long AI skills of 2023 AI are obsolete now?

Surely not prompt engineering itself, for example.


Persona prompting. (Unless the persona is the point as in role-playing.)


If you're not writing your own agents, you can skip this skill.


Are you sure? Looking forward - AI is going to be so pervasively used, that understanding what information is to be input will be a general skill. What we've been calling "prompt engineering" - the better ones were actually doing context engineering.


If you're doing context engineering, you're writing an agent. It's mostly not the kind of stuff you can do from a web chat textarea.


Those issues are considered artifacts of the current crop of LLMs in academic circles; there is already research allowing LLMs to use millions of different tools at the same time, and stable long contexts, likely reducing the amount of agents to one for most use cases outside interfacing different providers.

Anyone basing their future agentic systems on current LLMs would likely face LangChain fate - built for GPT-3, made obsolete by GPT-3.5.


Can you link to the research on millions of different terms and stable long contexts? I haven't come across that yet.


You can look at AnyTool, 2024 (16,000 tools) and start looking at newer research from there.

https://arxiv.org/abs/2402.04253

For long contexts start with activation beacons and RoPE scaling.


I would classify AnyTool as a context engineering trick. It's using GPT-4 function calls (what we would call tool calls today) to find the best tools for the current job based on a 3-level hierarchy search.

Drew calls that one "Tool Loadout" https://www.dbreunig.com/2025/06/26/how-to-fix-your-context....


So great. We have not one, but two different ways of saying "use text search to find tools".

This field, I swear...it's the PPAP [1] of engineering.

[1] https://www.youtube.com/watch?v=NfuiB52K7X8

I have a toool...I have a seeeeearch...unh! Now I have a Tool Loadout!" *dances around in leopard print pyjamas*


RoPE scaling is not an ideal solution since all LLMs in general start degrading at around 8k. You also have the problem of cost by yolo'ing long context per task turn even if the LLM were capable of crunching 1M tokens. If you self host then you have the problem of prompt processing time. So it doesnt matter in the end if the problem is solved and we can invoke n number of tools per task turn. It will be a quick way to become poor as long as providers are charging per token. The only viable solution is to use a smart router so only the relevant tools and their descriptions are appended to the context per task turn.


Thanks for the link. It finally explained why I was getting hit up by recruiters for a job that was for a data broker looking to do what seemed like silly uses.

Cloud API recommender systems must seem like a gift to that industry.

Not my area anyways but I couldn't see a profit model for a human search for an API when what they wanted is well covered by most core libraries in Python etc...


How would "a million different tool calls at the same time" work? For instance, MCP is HTTP based, even at low latency in incredibly parallel environments that would take forever.


There's a difference between discovery (asking an MCP server what capabilities it has) and use (actually using a tool on the MCP server).

I think the comment you're replying to is talking about discovery rather than use; that is, offering a million tools to the model, not calling a million tools simultaneously.


HTTP is an implementation detail, and doesn't represent any kind of unavoidable bottleneck vs. any other transport protocol one might use to do these kinds of request/response interactions.


It wouldn't. There is a difference between theory and practicality. Just because we could, doesnt mean we should, especially when costs per token are considered. Capability and scale are often at odds.


MCPs aren't the only way to embed tool calls into an LLM


Doesn't change the argument.


It obviously does.


It does not. Context is context no matter how you process it. You can configure tools without MCP or with it. No matter. You still have to provide that as context to an LLM.


If you're using native tool calls and not MCP, the latency of calls is a nonfactor; that was the concern raised by the root comment.


yes, but those aren’t released and even then you’ll always need glue code.

you just need to knowingly resource what glue code is needed, and build it in a way it can scale with whatever new limits that upgraded models give you.

i can’t imagine a world where people aren’t building products that try to overcome the limitations of SOTA models


My point is that newer models will have those baked in, so instead of supporting ~30 tools before falling apart they will reliably support 10,000 tools defined in their context. That alone would dramatically change the need for more than one agent in most cases as the architectural split into multiple agents is often driven by the inability to reliably run many tools within a single agent. Now you can hack around it today by turning tools on/off depending on the agent's state but at some point in the future you might afford not to bother and just dump all your tools to a long stable context, maybe cache it for performance, and that will be it.


There will likely be custom, large, and expensive models at an enterprise level in the near future (some large entities and governments already have them (niprgpt)).

With that in mind, what would be the business sense in siloing a single "Agent" instead of using something like a service discovery service that all benefit from?


My guess is the main issue is latency and accuracy; a single agent without all the routing/evaluation sub-agents around it that introduce cumulative errors, lead to infinite loops and slow it down would likely be much faster, accurate and could be cached at the token level on a GPU, reducing token preprocessing time further. Now different companies would run different "monorepo" agents and those would need something like MCP to talk to each other at the business boundary, but internally all this won't be necessary.

Also the current LLMs have still too many issues because they are autoregressive and heavily biased towards the first few generated tokens. They also still don't have full bidirectional awareness of certain relationships due to how they are masked during the training. Discrete diffusion looks interesting but I am not sure how does that one deal with tools as I've never seen a model from that class using any tools.


> already research allowing LLMs to use millions of different tools

Hmm first time hearing about this, could you share any examples please?



Providing context makes sense to me, but do you have any examples of providing context and then getting the AI to produce something complex? I am quite a proponent of AI but even I find myself failing to produce significant results on complex problems, even when I have clone + memory bank, etc. it ends up being a time sink of trying to get the ai to do something only to have me eventually take over and do it myself.


Quite a few times, I've been able to give it enough context to write me an entire working piece of software in a single shot. I use that for plugins pretty often, eg this:

  llm -m openai/o3 \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers issue:org/repo/123 which fetches that issue
      number from the specified github repo and uses the same markdown logic as the HTML page to turn that into a fragment'
Which produced this: https://gist.github.com/simonw/249e16edffe6350f7265012bee9e3...


I had a series of “Using Manim create an animation for formula X rearranging into formula Y with a graph of values of the function”

Beautiful one shot results and i now have really nice animations of some complex maths to help others understand. (I’ll put it up on youtube soon).

I don't know the manim library at all so saved me about a week of work learning and implementing


First, you pay a human artist to draw a pelican on a bicycle.

Then, you provide that as "context".

Next, you prompt the model.

Voila!


Oh, and don't forget to retain the artist to correct the ever-increasingly weird and expensive mistakes made by the context when you need to draw newer, fancier pelicans. Maybe we can just train product to draw?


This hits too close to home.


How to draw an owl.

1. Draw some circles.

2. Prompt an AI to draw the rest of the fucking owl.


And then the AI doesn’t handle the front end caching properly for the 100th time in a row so you edit the owl and nothing changes after you press save.


[flagged]


Hire a context engineer to define the task of drawing an owl as drawing two owls.


From the first link:Read large enough context to ensure you get what you need.

How does this actually work, and how can one better define this to further improve the prompt?

This statement feels like the 'draw the rest of the fucking owl' referred to elsewhere in the thread


I'm not sure how you ended up on that page... my comment above links to https://simonwillison.net/2025/Jun/27/context-engineering/

The "Read large enough context to ensure you get what you need" quote is from a different post entirely, this one: https://simonwillison.net/2025/Jun/30/vscode-copilot-chat/

That's part of the system prompts used by the GitHub Copilot Chat extension for VS Code - from this line: https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

The full line is:

  When using the {ToolName.ReadFile} tool, prefer reading a
  large section over calling the {ToolName.ReadFile} tool many
  times in sequence. You can also think of all the pieces you
  may be interested in and read them in parallel. Read large
  enough context to ensure you get what you need.
That's a hint to the tool-calling LLM that it should attempt to guess which area of the file is most likely to include the code that it needs to review.

It makes more sense if you look at the definition of the ReadFile tool:

https://github.com/microsoft/vscode-copilot-chat/blob/40d039...

  description: 'Read the contents of a file. Line numbers are
  1-indexed. This tool will truncate its output at 2000 lines
  and may be called repeatedly with offset and limit parameters
  to read larger files in chunks.'
The tool takes three arguments: filePath, offset and limit.


So who will develop the first Logic Core that automates the context engineer.


The first rule of automation: that which can be automated will be automated.

Observation: this isn't anything that can't be automated /


Rediscovering encapsulation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: