A lot of things in git are just pointers to commits, and then the git implementation handles them under the covers in some way that usually makes sense but not always.
One example that also bites people: moving files isn't stored in git - if you move files (even with `git mv`) and create a new commit, the moves aren't stored, but this is reconstructed later by the client based on similarity, which comes from the diff algorithm.
Yup. “Storing moves” is the kind of thing that might sound intuitively obvious but then gets gnarly and non-obvious when you think about it for five minutes. And so something that might be “obvious” to do then turns out to be so non-obvious—how to catch all file moves (intent) outside of simple identitical content cases, and how do you represent them internally?—that you realize that just using snapshots is really the best thing to do.
It’s completely trivial. The obvious and correct place is in the commit object just like author and date and such, since renaming is semantically part of the commit, not the tree:
And you don’t detect moves (because that’s madness), but require that people record them deliberately, just like every other VCS has done. There’s even git-mv already, it just skips a step that every other VCS’s equivalent command would do. (And technically this all works out because the index is a commit, so you can record the rename normally.)
Of course, all of this assumes that moving a file is a meaningful operation. Perhaps ideally (for most languages and systems) you’d track this in far smaller chunks, so that you can track changes to a function even when it alone was moved to a different file. But things like Git aren’t interested in those kinds of semantics, and work technically at the file level, more or less, so I think it should track renames because in practice straightforward renames are super common, but often also involve other changes that thwart rename detection. Years ago Linus explained why he didn’t like storing moves (someone else has linked it), but I’m largely not sold with his reasoning—the theory of the perfect has hindered the useful, and file renames are commonly meaningful in ways more than he said.
Like I implicitly said: how to do it beyond the “simple identical content cases”?
But if the solution is for the user to explicitly order renames (i.e., this renamed Java class is a file move) then the solution is indeed simple.
I see the point that Linus was making that you may want to be able to see “function moves” and so on. But in practice I am very often interested in file moves since you can inspect the file history easily in Git—except when you hit some wall because someone renamed the file. Then you need to re-run the command with `--follow`. Contrast all of that with a function move... I almost never can summon the will to fish out the incantation (like a regex or a robust line range) which will give me the history of a function across intra- or inter-file moves and so on.
The problem with that scenario is that usually it doesn't support a real-world-scenario where you do a rename in the tool (like some IDE) and it doesn't do the corresponding git operation.
(yes, some IDE might have git integration, but personally I don't like my IDE messing with git, except read-only (annotate, diff))
That’s… nothing special. If you don’t have Git integration in your IDE, you already have to do something like `git mv` or a `git add` and `git rm`. Nothing has changed in this new hypothetical world.
I think this is the one thing I feel BitKeeper does better than Git. Git can get confused about where a file came from, for moves but especially for copies, and so the version history ends, even if you ask it to try and follow along. BitKeeper, on the other hand, keeps the moves and copies as part of the history, so you can always trace it through to the origin of the file, no matter how circuitous.
It's kind of funny to see Linus browbeaten other people into submission regardless of him being right or not, while claiming "I am always right".
A few counter points:
- `hg` has `cp`, and I believe both Meta and Google's internal systems have that;
- git has `mv`, which was added later, but it is really janky and git would forget files are moved which I think it is because git doesn't try to track that, likely because of the philosophy here;
- as for storing file moves - nobody said you *have* to use this information, but you can certainly use this information to help with things.
The whole thread is an interesting read though and I will try going through it someday - maybe doing that would change my mind.
I'd be happy to argue why Linus is wrong here. Many things would be much easier if git recorded some more metadata in every commit: file moves, and branch moves, to start with.
Having some sort of notion of "parent branch" would be very useful for a number of common operations, and a "renamed file" without having to rely on client dependent heuristics too. Empty files trip people up all the time so a "create file" would fit in perfectly.
These concepts would also be a good basis for more user friendly clients. Other version control systems do this the surprise factor should be low.
People would get lazy and rename a file without telling Subversion they had done it, so it would write a “old file deleted, new file created from nothing” revision. Most of the merge conflict resolution machinery just couldn’t run without the missing guidance. Git infers someone probably renamed a file you edited or vice versa, which seems risky but works better in practice.
In short, Linus stance is that file renaming doesn’t matter, only the contents of files matter, and the moving of contents between files. Moved/renamed files then fall out as a special case of moving content.
Personally, I think this is a case of the better being the enemy of the good, and his “clearly superior algorithm” doesn’t work as well as claimed in practice. Or maybe tooling merely still isn’t up to snuff after 18 years.
I don't think it's about having a stance, it's about git's architecture. From the commit graph point of view, there's no such things as moving anything at all, neither files nor content. Commits represent a whole new state of the repository, not a diff from the previous state. The only way a commit is linked to the previous state is via parent pointer, it can otherwise be completely unrelated (and you can simply change the parent pointer without changing anything else in the commit). Any diffs are calculated at runtime. The issue with renames is just a consequence of assuming such data model - you could try to plaster it over with some metadata, but ultimately you would still be fighting against the model rather than working with it.
Many people develop a bad mental model with commits as diffs, because that's what the UI makes them think commits are. It can work for a while, but inevitably leads to confusion later on.
As you say, commits link to their parent(s), and those links effectively represent the edges of the commit graph. It makes perfectly sense to record moves on those edges. That’s how other VCSs do it. There is no conflict with the commit model.
Viewing the commit graph in terms of nodes (commits) or edges (diffs) is equivalent, these are dual views you can easily convert between. The internal representation is independent from that. Some VCSs use a mix of diffs and full revisions internally. Even Git uses delta compression when packing objects.
What I meant is that git doesn't have any structure to represent an edge other than a simple pointer. Conceptually it wouldn't be a big change to add some, but the consequence of that is that everything in git revolves around nodes rather than edges, and whenever the concept of an edge is needed (such as in "cherry-pick") it's being calculated on fly.
I don’t see where this would be causing any issues. There is a canonical place where to put edge metadata, namely in the child commit. And whenever you’re interested in move information, you have to process the respective child commit anyway.
If you think of it not as a "rename" (which would belong in the edge object if it existed) but rather as a "note: the file A in this tree was known as B in the parent tree" it would make perfect sense to store it in the child commit.
Git doesn't store any individual changes: files moved, lines added, line deleted, etc.
It stores a commit graph, and a tree at each of those commits. (A lossless compression algorithm deduplicates information.)
There's no need for the author to be concerned with what diffing information gets incorporated into the commit. Diffs are up to the viewer of the commit history.
This has resulted in a feature not in VCSs that do track renames: using matching lines, git blame can track changes across files that were combined in a commit, where others would record half the lines as being a rename from one file and the other half as new lines (if you even thought to do it like that when making the commit; more likely the whole file would be tracked as new).
My TL;DR; for git commits is that these are connected like a linked list but in reverse and has more pointers than just head/tail. I recommend having a look at Merkle trees. I don't understand git cli, but I can manipulate git commits, branches, tags etc well based on basic understanding using a good git UI.
One example that also bites people: moving files isn't stored in git - if you move files (even with `git mv`) and create a new commit, the moves aren't stored, but this is reconstructed later by the client based on similarity, which comes from the diff algorithm.
And git has multiple diff algorithms to pick from: https://git-scm.com/docs/git-config#Documentation/git-config...
And optionally to not detect renames in diff output with `diff.renames`: https://git-scm.com/docs/git-config#Documentation/git-config...