I'm interested in what stopped you from finishing diffs and diff based editing. I built an AI software engineering assistant at my last company and we got decent results with Aider's method (and prompts, and hidden conversation starter etc). I did have to have a fallback to raw output, and a way to ask it to try again. But for the most part it worked well and unlocked editing large files (and quickly).
Excellent question! We just didn't have the resources at the time on our small team to invest in getting it to be good enough to be default on. We had to move on to other more core platform features.
Though I'm really eager to get back to it. When using Windsurf last week, I was impressed by their diffs on Sonnet. Seems like they work well. I would love to view their system prompt!
I hope that when we have time to resume work on this (maybe in Feb) that we'll be able to get it done. But then again, maybe just patience (and more fast-following) is the right strategy, given how fast things are moving...
An interesting alternative to diffs appears to be straightforward find and replace.
Claude Artifacts uses that: they have a tool where the LLM can say "replace this exact text with this" to update an Artifat without having to output the whole thing again.
I think this is going to be the answer eventually.
Once one of the AI companies figures out a decent (probably treesitter-based) language to express code selections and code changes in, and then trains a good model on it, they're going to blow everyone else out of the water.
This would help with "context management" tremendously, as it would let the LLM ask for things like "all functions that are callers of this function", without having to load in entire files. Some simpler refactorings could also be performed by just writing smart queries.
Oh that is super interesting! I wonder if they track how often it succeeds in matching and replacing, I'd love to see those numbers in aggregate.
Total anecdote, but I worked on this for a bit for a research-level-code code editor (system paper to come soon, fingers crossed!) and found that basic find-and-replace was pretty brittle. I also had to be confident the source appears only once (not always the case for my use case), and there was a tradeoff of fuzziness of match / likelihood of perfectly correct source.
But yeah, diffs are super hard because the format requires far context and accurate mathematical computation.
Ultimately, the version of this that worked the best for me was a total hack:
Prefix every line of the code with L#### -- the line number. Ask for diffs to be the original text and the complete replacement text including the line number prefix on both original and replacement. Then, to apply, fuzzy match on both line number and context.
I suspect this worked as well as it did because it transmutes the math and computation problems into pattern-matching and copying problems, which LLMs are (still) much better at these days.
I suspect any other "hook" would work just as well, a comment with a nonce--and could serve as block boundaries to make changes more likely to be complete?
This is actually a very powerful pattern that everybody building with LLMs should pay attention to, especially when combined with structured outputs (AKA JSON mode).
If you want an LLM to refer to a specific piece of text, give each one an ID and then work with those IDs.
Aider actually prompts the LLM to use search/replace blocks rather than actual diffs. And then has a bunch of regex, fuzzy search, indent fixing etc code to handle inconsistent respnses.
Aider's author has a bunch of benchmarks and found this to work best with modern models.
What we found was that error handling on the client side was also very important. There's a bunch of that in Aider too for inspiration. Fuzzy search, indent fixing, that kind of stuff.
And also just to clarify, aider landed on search/replace blocks for gpt-4o and claude rather than actual diffs. We followed suit. And then we showed those in a diff UI client side