Hacker News new | past | comments | ask | show | jobs | submit login
Myers Diff Algorithm – Code and Interactive Visualization (robertelder.org)
197 points by robertelder on July 3, 2017 | hide | past | favorite | 23 comments



I don't want a diff algorithm that finds the mathematically smallest edit distance, I want one that preserves meaningful chunks of code. I never delete the closing } from one function and then the entire function following it except its last }.


Here you go, it's called Tichy diff: http://ftp.cs.purdue.edu/research/technical_reports/1983/TR%...

This 1983 paper is still relevant today, it's an algorithm much better suited for code diffs and I'd like to use this instead. It's the best algorithm for big code refactors. I implemented it in Go, and hope to get back to it.

Learned about it on: http://bryanpendleton.blogspot.com/2010/04/more-study-of-dif...



HN hug? :)

The title of the paper is "The String-to-String Correction Problem with Block Moves". You can search for it with filetype:pdf on Google.

Here is another link (even better quality): https://www.researchgate.net/profile/Walter_Tichy/publicatio...

For some reason, I cannot edit my previous comment :(


This is a bit of a holy grail, that in practice isn't actually that important. https://stackoverflow.com/questions/523307/semantic-diff-uti... has a good overview of some of the possibilities.

Makes for amazing demos. Might some day be what we are all using. My hopes have been dampened severely by how long this has taken to make any head way. (And, honestly, it couples your diff tool to the rest of your tool chain rather heavily. I can see why it has a hard time taking off.)


I think you could get 90% of the utility of semantic syntax tree diff by standard text diff + some simple rules.

For example preferring to split on empty (whitespace-only) lines. Almost all languages and coding styles use empty lines as a "semantic separator" marker.


As a newbie to how diff-ing is done in production, does git use vanilla edit-distance diffing too?


Git uses Myers diff, but recently added some heuristics to "shift" the hunks in semantically meaningful ways. See https://github.com/mhagger/diff-slider-tools for the experiments that led to this feature.

It also supports a few other diff algorithms: `--patience` and `--histogram`, but in my experience they produce the same output as Myers in most cases.


i.e.:

1) convert code sample 1 to AST 2) convert code sample 2 to AST 3) do text-based diffs while paying attention to the ASTs

A closing brace is really part of the function started by the function definition and opening brace. As such, a closing brace from function X should never "match" an opening brace from function Y.

I imagine that simple chopping code into multiple text blocks, one for each function, before passing it to a "diff" routine would accomplish most of that. After all, we already chop code into text blocks by filename...


Chopping up the code by function is not enough. The same problem exists for blocks within functions, like with a sequence of if statements.


Yes... that should go without saying.


A plausible approach would be to instrument the editor to log 'edit -events'. Then one could fit/learn an appropriate probabilistic model/grammar. After that given two pieces of text A and B it becomes a question of finding the most probable edits that takes you from A to B.

This is far from a new idea. Age old Levenstein distance can be seen as a special case. Here the edit model is very simple. All letter addition, deletion, and substitutions are equally likely. More realistic a model one uses the better the results will look like provided one has enough data to train.

One could train models for specific languages (natural as well as programming). If we have language specific syntax highlighting, why not language specific edit models. One could do personalized models, etc etc.


A couple months ago I wrote a diff algorithm that was pretty popular on the front page here, http://prettydiff.com/guide/unrelated_diff.xhtml


Another explanation of Myers diff algorithm that I found interesting: https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm...


My God that code is terrifying. It looks like output from a code obfuscator.


Pretty sure this is the same Eugene Myers I had worked with a bit in the past -- he's the kind of guy who took, "Wow, this looks complicated," or "I'm having a hard time following this," as the highest sort of compliment.


This algorithm is actually really useful in genomic sequencing - daily usage for me right now. Cool to see it pop up here, although I think it's less generally useful than other edit distance methods.


Wow! this is a very beautiful code and website

- single letter variables, as in the rest of mathematics

- succinct without being obfuscated

- the whole thing fits on your screen without need to scroll (no dependences, no hidden pieces)

- the complex algorithm is clarified in the text below, breaking it into smaller parts and explaining each one in detail

- comprehensive examples, figures and references

This is truly a model of exposition.


Seems like this would have been super useful when I was creating a Levenshtein transition effect: https://jsfiddle.net/kdqost0z/2/

Not that efficiency really matters much for my use-case!


I always thought finding the optimal diff (from the user perspective) was an NP Complete problem.


I would expect that to be an unsolvable problem, because there will be plenty of cases where not all users agree on what is the optimal diff.


Wweereee


Haven't read it yet. Curious about the implication of the title. Expecting article to be that it could be asymptotically faster, but practically slower. :) (That probably says more about how I choose to read headlines than anything else.)

Edit: Read the article now. Incredibly happy to see that the implication was on purpose. :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: