I don't want a diff algorithm that finds the mathematically smallest edit distance, I want one that preserves meaningful chunks of code. I never delete the closing } from one function and then the entire function following it except its last }.
This 1983 paper is still relevant today, it's an algorithm much better suited for code diffs and I'd like to use this instead. It's the best algorithm for big code refactors. I implemented it in Go, and hope to get back to it.
Makes for amazing demos. Might some day be what we are all using. My hopes have been dampened severely by how long this has taken to make any head way. (And, honestly, it couples your diff tool to the rest of your tool chain rather heavily. I can see why it has a hard time taking off.)
I think you could get 90% of the utility of semantic syntax tree diff by standard text diff + some simple rules.
For example preferring to split on empty (whitespace-only) lines. Almost all languages and coding styles use empty lines as a "semantic separator" marker.
Git uses Myers diff, but recently added some heuristics to "shift" the hunks in semantically meaningful ways. See https://github.com/mhagger/diff-slider-tools for the experiments that led to this feature.
It also supports a few other diff algorithms: `--patience` and `--histogram`, but in my experience they produce the same output as Myers in most cases.
1) convert code sample 1 to AST
2) convert code sample 2 to AST
3) do text-based diffs while paying attention to the ASTs
A closing brace is really part of the function started by the function definition and opening brace. As such, a closing brace from function X should never "match" an opening brace from function Y.
I imagine that simple chopping code into multiple text blocks, one for each function, before passing it to a "diff" routine would accomplish most of that. After all, we already chop code into text blocks by filename...
A plausible approach would be to instrument the editor to log 'edit -events'. Then one could fit/learn an appropriate probabilistic model/grammar. After that given two pieces of text A and B it becomes a question of finding the most probable edits that takes you from A to B.
This is far from a new idea. Age old Levenstein distance can be seen as a special case. Here the edit model is very simple. All letter addition, deletion, and substitutions are equally likely. More realistic a model one uses the better the results will look like provided one has enough data to train.
One could train models for specific languages (natural as well as programming). If we have language specific syntax highlighting, why not language specific edit models. One could do personalized models, etc etc.
Pretty sure this is the same Eugene Myers I had worked with a bit in the past -- he's the kind of guy who took, "Wow, this looks complicated," or "I'm having a hard time following this," as the highest sort of compliment.
This algorithm is actually really useful in genomic sequencing - daily usage for me right now. Cool to see it pop up here, although I think it's less generally useful than other edit distance methods.
Haven't read it yet. Curious about the implication of the title. Expecting article to be that it could be asymptotically faster, but practically slower. :) (That probably says more about how I choose to read headlines than anything else.)
Edit: Read the article now. Incredibly happy to see that the implication was on purpose. :)