Excuse my ignorance, but couldn’t they just add a SHA256 hash to commit objects (or some new commit-verify object) of the entire trees current concatenated content, leave everything else SHA1 and get the same benefit without rewriting the entire thing from the ground up? Git could even do that as part of the git gc step slowly over time - tag commits with a secondary hash.
Rewriting the whole thing including every git repos history seems like throwing the baby out with the bathwater, when you could just add a secondary transparent verification instead. Just seems like there has to be a better way.
You can't change past commits to add that hash (without changing all commit hashes), so this method could only protect new commits. For any existing repo this would lead to a very weird security model: We admit that sha1 hashes are broken, and only guarantee that commits made by git versions newer than git x.x.x are safe from after-the-fact modification (or alternatively only commits made after date X).
My inclination is that protecting only new commits might be enough, but it gets me thinking: What would a practical attack on this look like, assuming sha1 was broken? Let's say I'm trying to insert a line of code that does something nefarious, and that it's now trivial to generate "magic text" you can stick anywhere in a file (eg, inside a comment at the end of a line) to get any desired sha1 hash.
Are all the other future commits still valid, or am I going to suddenly get conflicts or garbled text? Depending on where the modification is done, that code might have gone through much more churn -- especially if there are a bunch of sha-256 commits after it (which I can't attack). I don't know enough about how git stores content blobs to answer this.
Second problem: Can I push my replacement commit to another repository (eg, github)? Would even force push work? Do I have to delete branches and re-push my own? If I already have enough permission on the repository to do this, it means I can already push whatever I want -- so does this attack even matter at all?
Assuming that's successful (or I can trick people into using my own repository), what will happen to someone that already has a clone and does a pull? Will they get my change (and will it work or be a pile of conflicts or garbled text)?
Even if only fresh clones will get the changes it could still be quite devastating -- especially if using CI -- but I'm just not clear if this attack is even theoretically possible.
> My inclination is that protecting only new commits might be enough
Why? It's not the same as saying 'versions after vX are safe', it's the same as saying 'any unsafety after vX was there before, not introduced since' (both with 'as a result of SHA-1 collision' qualifiers of course).
> Can I push my replacement commit to another repository (eg, github)? Would even force push work?
Implementation dependent I suppose, but I wouldn't have thought so - I don't see why they'd actually check the content when the hash is supposed to indicate whether it differs or not.
> Do I have to delete branches and re-push my own? If I already have enough permission on the repository to do this, it means I can already push whatever I want -- so does this attack even matter at all?
I think an attack would look more like:
1. Create hostile commit that collides with extant commit SHA
2. Infiltrate a package repository, or GitHub, or corporate network, or ...
3. Insert hostile commit in place of real one
Of course it's a problem if 2 & 3 happen alone anyway, but the problem with the collision commit is that it makes it so much less detectable.
Git commits are snapshots, not diffs. Each commit contains a tree, which contains a list of files and their respective hashes. As long as its whole tree is SHA-256 then a commit should be safe, regardless of its history.
The downside to the migration would be that all unchanged files would be stored twice (once identified by SHA1, once identified by SHA-256). But you could work around that by hardlinking identical files.
This doesn't protect subdirectories unless you rewrite the entire tree structure with SHA256. I don't know if Git does that now, or not. Git generally points to unmodified subdirectories with the existing content hash; if the SHA1 is pointed to by SHA256, which is implied by the transition plan proposed in the grand-grandparent comment, then those subdirectories are essentially unprotected.
> Are all the other future commits still valid, or am I going to suddenly get conflicts or garbled text? Depending on where the modification is done, that code might have gone through much more churn -- especially if there are a bunch of sha-256 commits after it (which I can't attack). I don't know enough about how git stores content blobs to answer this.
A blob is a "snapshot" of a file. The next version of a file is a completely different blob with no direct relation to the previous.
"Pack files" use delta compression in order to lower the actual size of "similar" blobs.
You could get conflicts if you tried merging or rebasing over the nefarious blob, and the "patch history" (git log -p, which builds the patch view on the fly) would show possibly unexpected complete file replacements.
Couldn't they make a table that contains a list of all the old objects by SHA1 hash, for for each contains the new SHA256 hash of that object, and then commit this table in the repository?
I’m not known with the internal data structure of git, but couldn’t you add the new hash as a commit in a new format “on the side”, leaving the original commit as is?
This is kind of what rewriting the repo is. Yes, you could leave the SHA1 commit tree around afterwards (i.e., for convenience of existing URLs), but you wouldn't want to keep SHA1 around as the authoritative hashname.
Git does have an commit-related object called a note that you can attach as a separate object. [1]
Presumably the proposed "hash translation store" could use an approach similar to notes, and include the hash translations as objects in the git database (hopefully in a way that could be signed by a tag).
Hashing everything in one go doesn't scale well. When making a new commit you want to only hash a proportional part of the repository, and the tree structure of git allows that, only the files and "tree" objects (directory listings) that change are hashed again.
I’m already seeing a lot of discussion both here and over at LWN about which hash algorithm to use.
The Git team made the right choice: SHA2-256 is the best choice here; it has been around for 19 years and is still secure, in the sense that there are no known attacks against it.
Both BLAKE[2/3] and SHA-3 (Keccak) have been around for 12 years and are both secure; just as BLAKE2 and BLAKE3 are faster reduced round variants of BLAKE, Keccak/SHA-3 has the official faster reduced round Kangaroo12 and Marsupilami14 variants.
BLAKE is faster when using software to perform the hash; Keccak is faster when using hardware to perform the hash. I prefer the Keccak approach because it gives us more room for improved performance once CPU makers create specialized instructions to run it, while being fast enough in software. And, yes, SHA-3 has the advantage of being the official successor to SHA-2.
SHA-256 is probably the right choice, but I don't think it's as obvious as you suggest, given SHA-512/256.
SHA-512/256 is a standard peer-reviewed and well-studied way to run SHA-512 with a different initial state and then truncate output to 256 bits.
This is heavy bikeshedding, but SHA-512/256 would be a more conservative choice than SHA-256. Under standard assumptions, SHA-256 is no weaker than SHA-512. The structure is extremely similar to SHA-256, but a collision on intermediate state requires a collision on all 512 bits of state instead of 256.
On most 64-bit CPUs without dedicated hash instructions, SHA-512/256 is faster for messages longer than a couple of blocks, due to processing blocks twice as large in fewer than twice as many operations.
Currently, the latest server and laptop CPUs have SHA-256 hardware acceleration but not SHA-512 acceleration. I'm not sure how many phone CPUs support sha256 but not ARMv8.2-SHA extensions (SHA-512). If it weren't for this difference in hardware acceleration, there would be few reasons to use SHA-256.
That being said, the current difference in hardware acceleration support probably makes SHA-256 the right choice here.
SHA-512/256 is a lot newer than SHA2-256 (usually called SHA-256, but I prefer the SHA2 prefix to make it clear that it’s a very different beast than SHA3-256), and its speed on 32-bit CPUs is less than optimal, so I don’t see it as being a more conservative choice. In terms of security, it uses the same 19-year-old unbroken algorithm as SHA2-256.
I am aware of the length extension issues, but they are not relevant for Git’s use case.
In terms of support, SHA-512/256 has, as you mentioned, less hardware acceleration support, and it’s also not supported in a lot of mainstream programs like GNU Coreutils. I also know that some companies mandate using SHA2-256 whenever a cryptographic hash is needed.
Git made the right choice with SHA2-256: It’s the most widely supported secure cryptographic hash out there.
> BLAKE is faster when using software to perform the hash
Is BLAKE 3 still faster than sha-256 when using the cpu speciliazed instructions? I think most modern desktop CPUs has built-in instructions for SHA256.
I’m guessing when people compare BLAKE 3 to SHA 256 they’re comparing software to software, but this wouldn’t be the case in reality?
I haven’t seen any benchmarks for BLAKE3 vs. the Intel/AMD SHA extensions. My guess is that Intel hardware accelerated SHA-256 will be faster than BLAKE3 running in software for most real world uses.
I can tell you this much: It is only with Ice Lake, which was released in the last year, that mainstream Intel chips finally got native hi speed SHA-NI support. Coffee Lake and Comet Lake, which are still the CPUs in a lot of new laptops being sold right now, do not support SHA-NI.
It's possible that Blake3 might be faster than accelerated SHA-256 on large inputs, where Blake3 can maximally leverage its SIMD friendliness. OTOH, Blake3 really pushes the envelope in terms of minimal security margin. Performance isn't everything. SHA-3 is so slow because NIST wanted a failsafe.
NOTE: /proc/cpuinfo shows sha_ni detection, and the apt-get source of this version of OpenSSL confirms SHA extension support in the source code, but I didn't confirm that it was actually being used at runtime.
This is based on the parent’s numbers with a fudge factor to account for Blake3 being a faster version of blake2s256 (i.e. the 32-bit version of Blake2 which is the only version in Blake3)
Of course, this does take in to account that Blake3 has tree hashing and other modes which scale better to multiple cores.
(Edit: update figures; I need to scale up Blake2s256 not Blake2b512)
The BLAKE3 tree mode also takes advantage of SIMD parallelism on a single core, which ends up being a larger effect than the reduced number of rounds. At 2-4 KiB of input (depending on the implementation) it's 2x faster than BLAKE2s on my laptop. Where AVX2 and AVX-512 are supported, those kick in at 8 KiB and 16 KiB of input respectively, widening the difference further. The red bar chart at https://github.com/BLAKE3-team/BLAKE3 is a single-threaded measurement on a machine that supports AVX-512.
My experience in developing and maintaining Fossil is that the hashing speed is not a factor, unless you are checking in huge JPEGs or MP3s or something. And even then, the relative performance of the various hash algorithms is not enough to worry about.
Thanks for the insight. My intuition was kind of the same, but on modern hardware computing the digest-style (as opposed to cryptographic, slow-by-design) hash is essentially imperceptible for payloads in the low MBs -- and much above that is a use case for LFS.
> That's appalling. Fossil's implementation doesn't require a conversion.
“This is a key point, that I want to highlight. I'm sorry that it wasn't made more clear in the LWN posting nor in the HN discussion.
“With Fossil, to begin using the new SHA3 hash algorithm, you just upgrade your fossil binary. No further actions, workflow changes, disruptions, or thought are required on the part of the user.
* “Old check-ins with SHA1 hashes continue to use their SHA1 hash names.”
* “New check-ins automatically get more secure SHA3 hash names.”
* “No repository conversions need to occur”
* “Given a hash prefix, Fossil automatically figures out whether it is dealing with a SHA1 or a SHA3 hash”
* “No human brain-cycles are wasted trying to navigate through a hash-algorithm cut-over.”
“Contrast this to Git, where a repository must be either all-SHA1 or all-SHA2. Hence, to cut-over a repository requires rebuilding the repository and in the process renaming all historical artifacts -- essentially rebasing the entire repository. The historical artifact renaming means that external links to historical check-ins (such as in tickets) are broken. And during the transition period, users have to be constantly aware of whether they are using SHA1 or SHA2 hash names. It is a big mess. It is no wonder, then, that few people have been eager to transition their repositories over to the newer SHA2 format.”
The way I read the fossil's authors comments, old commits continue to use sha1 hashes. A repository will be vulnerable to sha1 collision attacks as long as there is an object in the repository that has not been hashed with the new algorithm.
For example, floppy.c could be replaced in a repo with file with the same sha1 hash as long as the last commit that modifies floppy.c used a sha1 hash.
Just to be clear:
Every time you modify a file, the new changes get put in using
SHA3. In an older repository, any given commit might have some
files identified using SHA1 (assuming they have not changed
in 3 years) and others identified using SHA3.
For example, the manifest of the latest SQLite check-in is
see at (https://www.sqlite.org/src/artifact/29a969d6b1709b80).
You can see that most of the files have longer SHA3 hashes,
but some of the files that have not been touched in three
years still carry SHA1 hashes.
An attack like what you describe is possible if you could
generate an evil.c file that has the exact same SHA1 hash as the
older floppy.c file. Then you could substitute the evil.c
artifact in place of the floppy.c artifact, get some
unsuspecting victim to clone your modified repository, and
cause mischief that way. Note, however, that this is a
pre-image attack, which is rather more difficult to pull off
than the collision attacks against SHA1, and (to my knowledge)
has never been publicly demonstrated. Furthermore, the evil.c
file with the same SHA1 hash would need to be valid C code that
does something evil while still yielding the same hash (good
luck with that!) and Fossil (like Git) has also switched over
to Hardened SHA1, making the attack even harder still.
As still more defense, Fossil also maintains a MD5 hash against
the entire content of the commit. So, in addition to finding
evil.c that compiles, does your evil bidding, has the same
hardened-SHA1 hash as floppy.c, you also have to make sure that
the entire commit has the same MD5 hash after substituting
the text of evil.c in place of floppy.c.
So, no, it is not really practical to hack a Fossil repository
as you describe.
Isn't this the same attack given as an example why git is migrating hash functions in the subject article?
The attack may be difficult and unlikely I'm not questioning that, but if I understand correctly then Fossil's migration is straightforward because they did not address the same issues Git chose to.
> if I understand correctly then Fossil's migration is straightforward because they did not address the same issues Git chose to.
I think more is at play here.
(1) You can set Fossil to ignore all SHA1 artifacts using the "shun-sha1" hash policy.
(2) The excess complication in the Git migration strategy is likely due to the inability of the underlying Git file formats to handle two different hash algorithms in the same repository at the same time.
But, I could be wrong. Post a rebuttal if you have evidence to the contrary.
(2) The excess complication in the Git migration strategy is likely due to the inability of the underlying Git file formats to handle two different hash algorithms in the same repository at the same time.
But, I could be wrong. Post a rebuttal if you have evidence to the contrary.
It seems unfair to demand a rebuttal when you are the one who made the claim.
According to the article at least, the difficulty stems mainly from their migration strategy, for converting all existing SHA1 hashes.
> the difficulty stems mainly from their migration strategy, for converting all existing SHA1 hashes.
That's essentially the same difficulty, since the only strategy for doing this that has been historically proven to work seamlessly and painlessly involves being able to handle both hash algorithms in the same repository at the same time.
> Furthermore, the evil.c file with the same SHA1 hash would need to be valid C code that does something evil while still yielding the same hash
...and also produce an innocent-looking diff!
I mean, you could stuff a bunch of random bytes into a C comment to force the desired hash in the output using these documented attack techniques, but anyone inspecting the diffs between versions is likely to see such an explosion of noise and call foul.
If you want an analogy, it's like someone saying they've learned to impersonate federal agent identification cards, only it requires that the person carrying the fake ID to have a thousand rainbow-dyed ducks on a leash in tow behind him.
Such attacks are fine when it's dumb software systems doing the checks, but for a source code repository where people do in fact visually check the diffs occasionally?
Well, let's just say that when someone manages to use SHAttered and/or SHAmbles type attacks on Git (or even Fossil) I expect that it won't take a genius detective to see that the repo's been attacked.
Sure, many thousands of people doing blind "git clone && configure && sudo make install" could be burned by a problem like this, but someone would eventually do a diff and see the problem on any project big enough to have those thousands of trusting users in the first place.
I'm not excusing these SHA-1 weaknesses, only pointing out that it won't be trivial to apply them to program source code repos no matter how cheap the attacks get.
For instance, the demonstration case for SHAttered was a pair of PDFs: humans can't reasonably inspect those to find whatever noise had to be stuffed into them to achieve the result.
I also understand that these SHA-1 weaknesses have been used to attack X.509 certificates, but there again you have a case very unlike a software code repo, where the one doing the checking isn't another programmer but a program.
The problem is that we are considering an issue where different people can get different objects for the same hash. If the people checking all see the valid files, they cannot raise any alarms to save the poor victims who got poisoned with the wrong objects. They'll clone from the wrong fork, and no amount of checking hashes or signed tags will prevent them from running compromised code.
...which will likely contain thousands of bytes of pseudorandom data in order to force the hash collision...
> they cannot raise any alarms
You think a human won't be able to notice that the diff from the last version they tested looks awfully funny? Code that can fool the compiler into producing an evil binary is one thing, but code that can pass a human code review is quite another.
You might be surprised how often that occurs.
I don't do a diff before each third-party DVCS repo pull, but I do diff the code when integrating such third-party code into my projects, if only so I understand what they've done since the last time I updated. Commit messages, ChangeLogs, and release announcements only get you so far.
Back when I was producing binary packages for a popular software distribution, I'd often be forced to diff the code when producing new binaries, since several of the popular binary package distribution systems are based on patches atop pristine upstream source packages. (RPM, DEB, Cygwin packages...)
Each time a binary package creator updates, there's a good chance they've had to diff the versions to work out how to apply their old distro-specific patches atop the new codebase.
Someone's going to notice the first time this happens, and my guess is that it'll happen rather quickly.
If this is your threat model, you don't need hashes or signed tags at all. Good for you. Thankfully both Fossil and Git disagree with you and take the threat seriously :)
That's an argument for why you shouldn't worry about sha1 attacks in source control, but we should take the attack for granted when discussing how to mitigate the attack.
If we weren't worried about sha1 collisions in git then we wouldn't switch to a new hash function.
When is the right time to worry? Maybe wait until someone publishes a practical attack, then wait years for the new code to get sufficiently far out into the world that you can switch to it?
I mean, I see you're expressing concern, but the first major red flag on this went up three years ago, and another big one went up last month. (https://sha-mbles.github.io/)
When we dealt with this same problem over in Fossil land, we ended up needing to wait most of three years for Debian to finally ship a new enough binary that we could switch the default to SHA-3. Fortunately (?) RHEL doesn't ship Fossil, else we'd likely have had to wait even longer.
Atop that same problem, Git's also got tremendously more inertia. Git has to wait out not only the Debian and RHEL stable package policies but also all of that infrastructure tooling they brag on. Every random programmer's editor, merge tool, Git front end... all of that which a project depends on will have to convert over before that one project can move to a post-SHA-1 future.
Doesn't all of this apply to git just as well, except for the last bit about the MD5 hash?
It just seems to me that the Fossil maintainers have decided that keeping all old SHA1 hashes is acceptable, while the git maintainers have decided that it is not.
Unless I've misunderstood, this is why it was "so easy" for Fossil to transition to a new hashing algorithm. Not some superiority in the design of Fossil, as implied on the Fossil forums.
And if you are that concerned about this type of attack, it may be worth your time to simply start a new Fossil repository using the sha3-only hash policy (writing a script to replay commits into the new repo, so you don't lose history).
It seems like a problem very few people need to worry about and Fossil has made the right trade-offs.
In addition to D. Richard Hipp's thoughts as HN user SQLite — author also of Fossil, so he oughtta know — I offer these:
1. Keep in mind that Fossil and Git are both applications of blockchain technology, which in this particular practical case means you must not only forge a single artifact's hash, you must also do it in a way that allows it to fit into the overall blockchain.
2. Fossil's sync protocol purposefully won't apply Dr. Hipp's hypothetical evil.c to an existing Fossil blockchain if presented it. Fossil will say, "I've already got that one, thanks," and move on. Only new or outdated clones could be so-fooled.
If you're looking for prior art, ZFS's application of Merkle trees predates both. I think there was some other public use before that, but I can't recall right now.
They are also using "Hardened SHA1", which detects collision attacks, and assigns a longer id to commits which seem malicious, while being backwards compatible.
Ah, so it uses "Hardened SHA1", which detects if you are trying to exploit SHA1, and then produces a longer, unambiguous hash. But otherwise Hardened SHA1 has the same output as SHA1, so it's a drop in replacement.
Then it also has a similar looking-ish migration to SHA3-256.
Fossil defaults to SHA3-256 since 2.10 (released in October 2019). But it has had SHA3-256 since March 2017, and generally any repos/clones managed with a Fossil version since then have been seamlessly updated to SHA3-256 in the background.
I wonder if it would make sense to use `concat(sha1, sha256)` hash algorithm. This wouldn't change the prefixes while improving strength of an algorithm (by including SHA256 in a hash).
I supposed you are advocating two distinct Merkle trees? Because otherwise the prefixes will change anyway.
But the only reason this would be attractive is because then people could keep using the existing prefixes to refer to the whole commit. But of course doing this would be insecure. So for this to make any sense at all, people would need to make good choices on when to use an insecure prefix and when to use the whole hash, because it's security relevant. This seems a bit doubtful to me.
To be fair, the prefix problem would exist no matter what hash function would you pick. GitHub displays 7 characters of a hash, giving 28 bits. You could very quickly generate collisions with birthday attack in pretty much no time. Prefixes are always going to be insecure because they are so short.
Correct, but backwards compatibility does make a difference here, as in: there are surely quite a few cases where it would not be attractive to use a shortened hash if git hashes are changed incompatibly anyway, but where it will be attractive to use the shortened hash, because that keeps an existing setup working as before.
Also: the prefixing increases the length of the hash (and hence the desire to shorten it) without adding any security.
Yeah, kinda agreeing here. The hash length will need to be increased anyway, but concatenation of SHA1 and SHA256 will be 104 bytes in total when displayed (40 + 64), which is a lot.
It may be a better to display SHA-256 commit hashes, but accept SHA-1 hash prefixes for old commits. It may be confusing for git to accept hashes that aren't visible in `git log`, but it's probably for the better.
I'm well aware concatenation wouldn't necessarily improve the strength. However, the idea is, even if SHA-1 was hopelessly broken. CONCAT(SHA1(x), SHA256(x)) would be at least as strong as SHA-256 (where "at least" means it may have the same strength).
If you know that it's a concatenation, couldn't you only look at the SHA1 part and completely bypass any other strong hash? On second thought probably not, because you might find any possible collision, that isn't a collition on all the other hash algorithms. If you bruteforce through a password list it would still apply though.
This doesn't work for collision resistance attacks. git commits aren't password hashes. Specifically, the attacker's goal in this case is to find different values a and b for which hash(a) = hash(b), rather than finding a value of m in h = hash(m) for known h.
Things like signed commits would still use the full hash, so that would make tampering with that impossible.
This solution would basically just make the UI backward-compatible while still requiring the complete modification of the internal to change the hash function.
You'd still risk a collision if you refer to commits using a shortened hash outside of git but something tells me that you don't even need a vulnerability to take advantage of that if you have an attack vector. For instance github seems to use 7 hex digit in short hashes, this could probably be bruteforced relatively easily (be it for SHA-1 or SHA-256). To give you an idea I looked at the current bitcoin difficulty (which AFAIK uses two rounds of SHA-256 internally and works by bruteforcing hashes with a certain number of leaning zeroes) and the hashes look like this: 000000000000000000028048b31e42bd53d3b36da90d1a840ae695ec1a5ee738
This would help if you _only_ shared the prefix, however git would still use the full hash.
The proposed method would have the advantage of keeping existing known abbreviations, which are _already_ less secure than SHA-1, while keeping the security of the second hash.
It also has the disadvantage that the full hash would become excessively large and unwieldy, so pros and cons.
This is pretty interesting and shows you shouldn't try to pull any sort of stunts if you're not a crypto expert. I've actually wondered before whether md5 + sha1 would result in something stronger than those two used individually. Now I know.
By the way, this may be rather obvious, but concatenating hash algorithms is a terrible idea for passwords. A password cracker could easily pick the less secure algorithm to crack, and ignore the other hash.
Note that git doesn't concern itself with reversing a hash function. The commit contents are part of a repository, there is no value in guessing the commit contents basing on its hash. Here, the hash function choice is purely about collision resistance.
But yeah, don't do weird things with hashes. Cryptography is hard. Don't invent memecrypto: https://twitter.com/sciresm/status/912082817412063233, it's not going to increase the security. Use a single algorithm if you can. Don't transform the output of a hash function in any way.
The linked article doesnt contradict the original post. Linked article says strength of 2 hash algos (of this type) is only as strong as the strongest and not the sum of their strengths. But original poster only needed the combined hash to be as strong as the sha256 for his/her purpose.
There is a downside that this would mean commit-prefixes remain sensitive to collisions. Hence anyone checking out a commit by a hash-prefix would still be vulnerable.
Not a dealbreaker by far, but still a slight mark against this solution.
Does git have code to detect whether a hash prefix is ambiguous? I know that if you use a short prefix (which is more likely to be shared by multiple objects), git will output an error message staying that the object reference is ambiguous IIRC.
I'm probably missing something, but isn't it simpler to just make both available separately and allow users to still reference by sha1, if they want to, while sha256 can be used for collision detection by git operations internally?
Wow! I wouldn't have guessed that Git had that vulnerability. Fossil solves it easily: creating a new repo involves generating a random project code (a nonce) which goes into the hash of the first commit, so that even two identical commit sequences won't produce identical blockchains.
Fossil lets you force the project ID on creating the repo, but the capability only exists for special purposes.
Is there an archive of crypto related future predictions?
How long until a specified length preimage attack can break bittorrent blocks?
I remember a paper published a ~decade ago estimating very short (well funded) ASIC sha1 collisons. Anyone have that ref?
EDIT: Should I have not said preimage? My understanding is bittorrent is broken (by DDoS, not infohash(?)) if you can make a bad block that matches the length and sha1 of a target block.
> EDIT: Should I have not said preimage? My understanding is bittorrent is broken (by DDoS, not infohash(?)) if you can make a bad block that matches the length and sha1 of a target block.
There are three different attacks
1. Collision, which is practical (expensive but practical) for SHA-1 today, lets somebody make two documents A and B which have the same hash. This is only useful if you can fool people somehow into accepting document B when they think it's document A because of the hash, for example with digital signatures.
2. Pre-image, which is not practical for any hashes you care about including MD5. This lets you find the document A given the hash(A) value. This is very niche, since obviously for large documents by the pigeon hole principle there will be many such pre-images and it's impossible to get the "right" one, for small inputs it can be relevant, sometimes.
3. Second Pre-image, likewise not practical. Given either document A or hash(A) which you could easily determine from document A, this lets you produce a new document A' that is different from A but hash(A') == hash(A). This would be extremely bad, and is what you'd need to attack real world Bittorrent from somebody else.
Often people say "pre-image" meaning strictly second pre-image, it's usually clear from context, and a true pre-image attack as I explained above is only rarely relevant.
Collision would only let bad guys corrupt their own purposefully constructed collision bittorrent, which like, why? So yes, Bittorrent would only really be in serious trouble if there was a second pre-image attack. But on the other hand, don't use broken cryptographic primitives. Attacks only get better, always.
1.5 Chosen-prefix collision: Given a prefix A, generate two values AB and AC, where B and C differ but are both prefixed with A. (AX is A concatenated with X). This exists for SHA1. It's more powerful than a basic collision wheri you can't pick the prefix, but weaker than either type of pre-image.
It's worth noting that this attack is a property of the Merkle–Damgård hash construction, not of SHA-1 specifically, which means SHA-2 (Git's path forward) is also vulnerable:
Fossil uses SHA-3, which has an entirely different construction, which is not at this time known to have a similar weakness. SHA-3 is also much newer, with a much shorter list of known attacks.
Ha that ELI5 is adorable, I love both how the person trying to answer in the affirmative resorts to more and more frantic hand-waving as it becomes obvious none of what they've said is true and most of it doesn't even make sense, while the person being "flagged" for their supposedly "highly inaccurate" simple statement that er, no, chosen prefix isn't about MD at all remains calm and doesn't care as people insist they must be wrong because after all they were flagged, and why would some anonymous user flag something as wrong unless they were an expert...
Anyway, as hinted above, chosen prefix has nothing to do with the type of hash construction, except in the sense that so far there were lots of Merkle–Damgård hashes and some of them are no longer safe, whereas until recently there weren't many of the Keccak family hashes.
The Wikipedia article is talking about Length Extension, which is a different phenomenon from chosen prefix collision attacks, and if it was a problem in Git (or indeed Fossil) would have doomed them both immediately anyway.
For a generic crypto hash you should use SHA-512/256 (NB this is not offering a choice that slash is part of the name) to avert Length Extension but since the DVCSs already seemingly put the effort in to be safe against it SHA-256 is a perfectly reasonable choice.
No, it's not the same as length extension. SHA1 is vulnerable to Chosen Prefix collisions. SHA2 doesn't have any known collision attacks faster than the birthday bounded brute force attack, let alone any chosen prefix collisions, but both do have length extension attacks. Also length extension isn't specific to Merkle–Damgård, though all Merkle–Damgård hashes are vulnerable to it without mitigations (like truncation of the output).
The reason why people mostly meand second pre-image when saying unqualified "pre-image" is that probably any imaginable method of reversing a hash function (given sufficiently long input to the hash) will with overwhelming probability produce hash input that is different from the original.
Preimage is really a whole different beast than collission.
It's also not particularly surprising. Just by its length SHA-1 has in its best case 80 bits of collission security and 160 bits of preimage security.
Now its important to understand that attacks usually don't cause full devastation, but they usually make attacks a bit better than optimal.
Attacks in the 60 bit range is what's possible, attacks in the 70 bit range is what's dangerous. It's easy to imagine that relatively small deviation from optimal security gets SHA-1 from 80 into the dangerous territorry (the attacks are in the low 60s range). However getting from 160 bit down to the 60/70 bit range would require massive improvements in attacks.
It's safe to say that SHA-1 is still very far from preimage attacks. Still to be clear I'd still recommend to get rid of it whereever you can. The far bigger risk is that you think you only need preimage security, while you actually need collission security for scenarios you haven't thought about.
To be fair for MD5 there is a known attack, it's just impractical. It's a real attack though because the whole point of a crypto hash is that you'd have to brute force it to win, and the paper shows a slightly quicker way because MD5 is broken. It's just not quick enough that you could actually do it.
Oh wait, perhaps you actually meant preimage as you said rather than I assumed second preimage. OK yes, that isn't ever going to be possible for non-trivial inputs.
I just want to add something the article couldn't cover. I know bmc and he's both a software geek's software geek and one of the friendliest, most helpful, and most genuine people I've ever met.
I didn't get the argument against just converting? Sure some code bases are large and spread out, but any git repo needs to have one blessed central point, and everyone needs to be able to just re-clone from the central repository whenever history is rewritten for whatever reason (could be that a huge file is trimmed from the past etc). Why can't all commits in the Kernel history be rewritten to SHA256? (Other than that it would be an annoying interruption in the development)?
The kernel doesn't really have the one central blessed point of which you speak. Sure you can grab mainline releases from Linus's repository, but that's not where the development actually happens. It really is a distributed project, and having to delete all those old repositories would really hurt.
If 2 separate copies of the same repository does the same rewrite to sha256, their histories are still compatible and equal up to the point where they diverge. So other than that the rewrite needs to happen in more places, it should still be doable. Needs to happen at more or less the same time however.
Disappointed they went with an ARX based hash, instead of KangarooTwelve, which uses the Keccak permutation. A lot of people on this thread think that SHA2 is more secure because it is older, but it is my understanding that that is completely wrong. Keccak is not only standardized, to get to that it had to win the SHA3 competition, during and after which it received, as far as I understand, unprecedented levels of scrutiny. And not only that, but, according to what I read, the Keccak-like cryptographic constructions (including the hash) are much more amenable to mathematical/cryptographical analysis because of not using addition (word-wise, instead of bit-wise, to be more correct). The idea is that a resourceful/moneyed attacker (like the NSA or China, etc.) could create successful attacks on an ARX hash without the public being able to come to the same developments because of no researchers having access to similar levels of resources.
The sad thing is that the ARX BLAKEx functions seem to be gaining undeserved amounts of hype. I do not think they are getting comparable scrutiny from researchers, seeing as BLAKEx hashes are ARX, and also changed considerably since the SHA3 contest (so it is far from clear that the scrutiny that BLAKE did receive translates to BLAKE2 or BLAKE3).
One thing that should also be noted is that ARX hashes are relatively less well suited to silicon implementations.
The Keccak team published a short and poignant relevant blog post back in 2017 as an answer to that notorious "Maybe skip SHA3" blog post: https://keccak.team/2017/not_arx.html
> The nuance that's being made here is that the public cryptanalytic results we have are from researchers that need to publish. However blackhats (be it government or private) have no such need. Thus, they do not care if the analysis is elegant or neat.
> This means that ARX functions will have less published analysis, but may still be successfully attacked.
> This isn't even a new argument they're making here. It's been well understood that simple cipher designs are better, because they are easier to understand. If you can understand it well, yet not break it, that gives confidence. If you don't understand it, it might break as soon as you do.
Since collision resistance is roughly half the number of bits, it seems unconscionable to me that anything below 256 bit hashes even exist, because 64 bits is crackable but 128 bits effectively never will be. This was well-understood even in the 90s when MD5 and SHA were first published.
Just thinking about this for the first time, I don't buy any argument about storage or performance, since those become less important as time goes on. It feels like Linus made a mistake here, and offloaded the inevitable work of upgrading repositories onto the general public (socialized the cost) which is something that all programmers should work harder to avoid.
Said as an armchair warrior who has never accomplished anything of any importance, I realize.
I don't understand the practical attack vector for breaking SHA1s in Git. Not only are objects checksummed by SHA1, they also encode the length. Finding a SHA1 collision is plausible, but finding a SHA1 collision that both lets you do something Nefarious, and is the length you need, seems really really unlikely
The shattered collision attack featured two pdfs with the same sha1 and wait for it, the same length. Also note that even with normal sha1, the length is hashed into the final sha1 hash already, that's what the merkle damgard scheme is about. You can read about it on Wikipedia.
Reusing the precise collision from the shattered attack is made impossible by initializing the state with anything other than the prefix from the shattered attack. But the cost for mounting such an attack yourself is only 11k USD. However, as git uses the sha1collisiondetection library, such an attack would be detected by current git. Thus, this library is a much better protection than the length encoding.
You're assuming that 100% of the source code matters, but most source code has comments. Some has a lot of comments (boilerplate headers). Delete all the comments and superfluous whitespace, add nefarious code, put in a comment in the remaining bytes for the sole purpose of causing a hash collision (likely plenty of bytes to play with).
> this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry
It's still plausible that one can pull a trick like that to introduce malicious code into the repo, but improbable.
This makes no sense. The collision manufacture algorithm of course produces the same length output in both the A and B documents. Doing otherwise would be considerably harder in fact.
The author does seem to concede that hitting all the checkmarks in an attack on git would be pretty tricky:
> An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry
The whole idea is that they want to switch away before these things become likely. They are unlikely now, but SHA-1 is only getting weaker as time goes by and more research is done.
To make the collision work you need to produce two different files, both with some randomish looking junk in them. So if you can do that in a way where you can substitute one of the files for the other without getting caught then you are almost for sure smart enough to also figure out a way to make the lengths the same.
As mentioned, the shattered PDFs[1] have the same length, however it's worth noting that adding the Git header breaks the matching, ie. you get different SHA sums for the files in Git because of the header.
> There is, of course, a way to unambiguously give a hash value in the new Git code, and they can even be mixed on the command line; this example comes from the transition document:
When people started using the phrase "Stockholm Syndrome" with respect to git I took it as a sort of hyperbole. A rhetorical device.
But the more 'improvements' they make to it the more literal that accusation becomes in my head. And what's worse is that I've grown enough callouses now that my response is an eyeroll instead of pain. I use git all the time, but it's terrible and I need something that is better, not just sucks less. And apparently soon, because I don't know when that koolaid is going to start looking good but it's not long now.
"Tire of it quickly" and "have an immediate gag reflex" are two completely different categories of negative reaction.
It's hard to see the sunset when you're down in the muck, and eventually 'less bad' starts to look like progress to you. It's a trap and you should be aware of it.
One alternative would be to just do lookups in both hash databases (until SHA1 is fully migrated away from), and reject invocations that conflict. Git's CLI already rejects ambiguous short hash prefixes for SHA1, it could easily reject ambiguous prefixes between SHA1 and SHA256 and otherwise allow unique prefixes for either hash. This would be pretty ergonomic for users.
For most cases that would suffice and would be ergonomic, but what if a full SHA1 also qualifies as a prefix of one or more SHA256, and you want the SHA1? There's still a need for a mechanism to disambiguate for these cases, even if it ends up very rarely needed.
You're talking about a 160 bit truncated hash collision on SHA256, which is extraordinarily unlikely if SHA256 is not itself completely broken (moreso than SHA1 already is!). I don't think any syntax is needed for that in the porcelain CLI; it could be handled with non-user-facing commands if it ever came up (it won't).
> extraordinarily unlikely if SHA256 is not itself completely broken (moreso than SHA1 already is
I was hoping I captured that by saying "very rarely". However, if SHA1 collisions can be made willingly, doesn't that mean that one can also willingly make a SHA1 hash that matches with the prefix of an existing SHA256 hash?
As far as I know, that kind of collision isn't practical at this time. So predicating UI decisions on that basis seems like a mistake to me (given how long git has already ignored the looming threat of SHA1 being broken).
When and if someone injects a SHA1 attack into your repository, and the main git CLI throws up its hands and says "hash collision" trying to access it, I'm not seeing major problems here. The git CLI doesn't need to provide convenient commands to interact with attacks that are not practical today. To the extent that these will become practical, I think git should drop the SHA1 lookup after a migration period regardless, and it would not hurt to provide a gitconfig knob to disable SHA1 lookup.
> doesn't that mean that one can also willingly make a SHA1 hash that matches with the prefix of an existing SHA256 hash?
No, the “prefix of an existing SHA256 hash” stops being relevant at that point – that’s just a full preimage attack on SHA1. Isn’t known to be feasible yet.
> I was hoping I captured that by saying "very rarely"
Don't cave in to sky-is-falling bullshit regarding the existing SHA1.
Git is not a crypto system; it's just version control.
We've used version control systems just fine that had no integrity features at all. For isntance you can go into a RCS ,v file and diddle anything you want. Some BSD people are still on CVS, and their world hasn't fallen apart.
That cute rhetoric will not fool anyone. Common git workflows use fairly succinct git commands:
git diff
git commit -p
git rebase -i HEAD~3
The command quoted in my original comment is just this we strip away the SHA256 garbage:
git log abac87a..f787cac
(Or maybe it is:
git log abac87a^..f787cac^
I cannot guess whether the ^ operator still has the same meaning or whether it is part of this ^{sha...} notation.)
The hashes will typically be copy and pasted, so you type just the git log, .. and spaces.
The fixed parts of convoluted git syntax can be hidden behind shell functions and aliases. But notations for referencing objects are not fixed; they will end up as arguments.
> I cannot guess whether the ^ operator still has the same meaning or whether it is part of this ^{sha...} notation.)
This isn't the first ^{...} notation. The manpage gitrevisions(7) also mentions <rev>^{/<text>} for referencing a commit based on a regular expression of its commit message, like
Though, this new notation is probably more in-line with the notation <rev>^{<type>}, which lets you disambiguate what you put in <rev> as in deadbeef^{tag}, so that it's not confused with deadbeef^{commit}.
EDIT: The article doesn't mention it, but I imagine one interpretation would take precedence and cause git to issue a warning when it's ambiguous. Right now, if I tag a commit with the hash of another commit, its interpretation as a tag takes precedence and I get a warning at the top, "warning: refname '368bc6e' is ambiguous." That would mean you'd only ever write ^{sha256} when the provided part of a sha256 hash is ambiguous with an existing sha1 hash or something else like a tag. That's also vice versa with ^{sha1}.
Since git is something that I rely on for everyday use, and long-term data stroage, and its development is being threatened by the inclusion of moronic changes I completely disagree with, I'm completely unreceptive to jokes. This is no laughing matter.
Articles like this are eye opening to me, in a bad way. Every once in a while, I get really curious about giving Fossil a try, because it does have some legitimately cool ideas, and then I see the documentation saying things like:
> Rebasing is the same as lying
And I think, "Holy crud do I not want to be part of this community."
The nice thing about Git is that (within reason) once I understood it, I was able to use it in very flexible ways.
It's really common for different projects I manage to range all over the place from the extreme "commits as literal history" perspective all the way to the "commits as literature/guide" perspective. Sometimes I don't rebase at all, sometimes I rebase a lot. Sometimes I commit everything, all the time, sometimes I refuse to commit any code that isn't a deployable feature. Sometimes I leave branches as historical artifacts, sometimes I don't care about history and I'm just trying to coordinate developers across timelines.
That's not to say that Git isn't opinionated about some things -- nearly all good tools have at least a few strong opinions. But Git passes the (IMO extremely low) bar of not conflating a workflow decision with a moral failing. Over the years as a software engineer, I've learned to be somewhat skeptical of programming/workflow heuristics advertised as rules, and to be very skeptical of heuristics advertised as ideologies.
I really don't understand the perspective of someone who can't think of even one good reason why they would ever want to edit history. You've never accidentally committed a password to repo, or had to respond to a takedown request?
> sometimes I don't care about history and I'm just trying to coordinate developers across timelines
The fact that Fossil preserves history does not prevent you from coordinating with people across timelines. It is rather the whole point of a DVCS.
> conflating a workflow decision with a moral failing
I think it's fairer to say that we don't think a data repository is any place for lies of any sort, even white lies.
> I've learned to be somewhat skeptical of programming/workflow heuristics advertised as rules, and to be very skeptical of heuristics advertised as ideologies.
Sure, flexible tools are often better than inflexible ones, but you also have to consider the cost of the flexibility. Here, it means someone can say "this happened at some point in the past," and it's just plain wrong.
That isn't always an important thing. Most filesystems and databases operate on the same principle, presenting only the current truth, not any past truth.
Yet, we also have snapshotting in DBMSes and filesystems, because it's often very useful to be able to say, "This was the state of the system as of 2020.02.04."
You don't need a snapshotting filesystem for everything, and you don't need Fossil for everything, but it sure is nice to have ready access to both when needed.
> You've never accidentally committed a password to repo, or had to respond to a takedown request?
> I think it's fairer to say that we don't think a data repository is any place for lies of any sort, even white lies.
I, too, wish this extreme hyperbole would be just left out of the discussion completely. It is offputting, and I think it's intentionally a bad faith argument, it fails to acknowledge the utility, the design intent, and the context behind rebase, which has been talked about at length by Linus and others.
When rebase is used as designed, according to the golden rule, it's not modifying published history, so it's not "lying". Whether rebase has safety problems is a separate issue from whether it's use as designed amounts to being "dishonest".
I'm all in favor of improved design choices, and if Fossil is making those better design choices, let them stand on their own without intentionally denigrating git and every user of git through utter exaggeration.
My understanding is that shunning is blacklisting specific artifacts. That's nice, but I don't understand how that solves the problem.
When I revise history in Git, even if it's just doing something as simple as removing sensitive information, I often need to replace that information, either through new commits, or by introducing minor edits to surrounding commits. I could add those changes on top of my current HEAD, but then checkouts of old versions would be broken. On the other hand, if I can just replay my commits while inserting extra code, I'll end up with something that's pretty close to my original history, with just the offending information excluded/replaced.
That carries the cost that people will need to force pull my repo, but at least the repo history will still roughly correspond to what development looked like, rather than being out-of-order and mostly impossible to build except for at my current HEAD.
As a followup question, what do you do if the sensitive information you need to exclude is in a commit message? `amend` won't help you, since it's not destroying information. Do you shun that commit and then... what?
It just seems like destroying information isn't enough unless you can also replace it?
> Sure, flexible tools are often better than inflexible ones, but you also have to consider the cost of the flexibility.
I appreciate this -- I like having multiple tools for different purposes. I don't see a problem with having a VC that focuses on auditability, or having one that goes in a radically different direction from Git. Fossil has very interesting ideas, which is why I try to pay it some attention whenever I see it mentioned or linked to.
However, whenever I follow those links and start digging deeper into the philosophy behind its design decisions, inevitably the conversation changes from, "here's our alternative approach to Git" to "what Git does is fundamentally wrong". It's not, "Fossil doesn't have this problem because we eschew rebasing", it's "why would anyone rebase?"
(Nearly) all architectural decisions have good and bad consequences. Sometimes those consequences are imbalanced, so we have heuristics that can say things like, "often X is a bad idea." That's fine.
More harmfully, sometimes people extend heuristics into rules that say, "it's never a good idea to do X". Programming rules are usually wrong.
But programming ideologies the worst, because they say, "there is something mentally or morally wrong with a person who would do X". This is toxic for the reasons that Fossil devs already mention in their documentation:
> programmers should avoid linking their code with their sense of self
Programming ideologies explicitly encourage developers to have egos, because ideology conflates architectural decisions and workflow processes with individual worth. Programming ideologies make it harder for people to grow as programmers, because they tie intellectual growth to fears about being wrong. They're completely toxic.
And is Fossil's documentation promoting an ideology? I'm guessing that you'd disagree with me on this, but my take is that when Fossil's official documentation says things like:
> Honorable writers adjust their narrative to fit history. Rebase adjusts history to fit the narrative.
or
> It is dishonest. It deliberately omits historical information. It causes problems for collaboration. And it has no offsetting benefits.
That's not designing a focused tool to support specific heuristics, or making a case that, "sometimes strict auditability is important". That's just trolling for fights.
Which of the two philosophies matches better with the way your project works? That alone is a pretty good guide to whether you want Fossil or Git. (Or something else!)
I generally try very hard not to argue about definitions, and it seems like you're using the word "ideology" differently than me. To get past that, substitute out my word "ideology" for "foobar".
A foobar is an assertion that there is a single right or wrong way to look at the world. Not even just a single correct or incorrect way, but a right way -- a proper way. If the problem with a rule is that it overgeneralizes what the world is, the problem with a foobar is that it generalizes what the world ought to be. To the extent that a foobar allows space for deviation or alternate approaches to architecture, it's only with the implicit understanding that those deviations are on some level, a kind of small sin.
Not all foobars are necessarily wrong, but in the world of software, they are particularly dangerous, and should be approached with caution. Under a foobar, a rebase isn't an organization strategy, it's a "white lie". A writer isn't optimizing for a specific audience or purpose, they're "honerable". An agreed-upon set of rules for everyone accessing a repo can be "dishonest".
Different people have different standards for this kind of thing -- but is it really all that weird or abnormal to worry that this kind of language can encourage toxicity in a community, or that it could encourage developers to think of architectural outcomes as personal validations or attacks? To me, that language sounds very foobar, and it makes me nervous about what experience I'm going to have if I adopt Fossil and then start asking questions to the community about how to use it in unconventional ways.
To be clear, there are other pages in Fossil's documentation that are much, much better about this kind of thing (particularly the Fossil vs Git page).
But even on those pages, the thing is: I use Git constantly. I am intimately familiar with its strengths and flaws. I really don't need the documentation to tell me that Git's storage is an "ad-hock pile-of-files", because I've worked with those files before and built 3rd-party tools to manipulate them, and while there are flaws, sometimes being able to do a completely dependency-free read on any OS/platform to find the current HEAD is quite useful.
When I read the docs, I just want to know what makes your software different. You're not going to convince me that actually all of my experiences were wrong, and everything I like about Git is secretly terrible. You might be able to convince me that there are specific problems Git isn't optimized for, and that Fossil can solve them.
When Fossil is talking about Bazaar and Cathedral development, I'm really interested in learning more. When Fossil is taking cheap shots at purposeful design decisions in Git that are actually really good for certain classes of problems, I lose confidence that the docs know what they're talking about.
For what it’s worth, I could not agree more, and I wish I could upvote this twice.
If I were to attempt to help with the terminology, instead of sorting out the definition of ideology, I might say you’re talking about dogma and wyoung2 was referring to philosophy most directly above, but indirectly using philosophy to justify dogma.
There’s a fairly stark irony here in using words like ‘lie’ and ‘dishonest’ to judge this git workflow while at the same time taking cheap shots... but in the end I suppose the Fossil devs can describe things any way they want, and I don’t have to like it or use Fossil.
I have no interest in Fossil because it stores stuff in sqlite databases instead of the filesystem which I think is a stupid approach. I'm also not interested in version control systems that are dragging along a wiki and bug tracker. I just want a C program in /usr/bin that does version control.
If you think your filesystem-based Git repo is easy to manipulate, go poking around in there, and what you'll find is a bespoke one-off pile-of-files database! Given a choice between Git's DB and SQLite, I put more trust into SQLite.
> I just want a C program in /usr/bin that does version control.
...which Git doesn't provide. Git is hundreds of files scattered all over your filesystem, a large number of which aren't C binaries anyway, and of those that are, only one of them is the front-end program sitting in /usr/bin, whereas Fossil can be built to a single static executable in /usr/bin.
And if you can't build Fossil statically on your system, it's likely due to an OS limitation rather than something about Fossil itself, as on RHEL where they've made fully static linking rather difficult in the past few releases.
Getting back to Git, large chunks of Git are written in POSIX shell, Perl, Python, and Tcl/Tk. Almost all of Fossil is written in C, and the rest of the code is embedded within that binary running under built-in interpreters rather than depending on platform interpreters.
This has nice knock-on effects, one of which is that Fossil is truly native on Windows, whereas you have to drag along a Linux portability environment to run Git on Windows. Another is that Fossil plays nicely with chroot/jail/container technology.
> I'm also not interested in version control systems that are dragging along a wiki and bug tracker.
The diatribe against rebasing is stupid. In fact, not having more than one parent is a good thing because you with multiple parents, you don't know what is relevant. The history has turned into a hairball. When you try to navigate back in time, you face forking roads at every step and it turns into a maze walk.
The point is valid that when we rebase, we are losing history: the context of where that change was originally parented.
However, (1) the history does not matter if the change was parented in some temporary context, like your unpublished changes and (2) the information can be tracked in other ways, such as a Gerrit Change-Id (or something like it) in the commit message.
Regarding (1) the extra parent pointers in a merge commit cause retention of garbage. If we do everything with merge instead of rebase, we will never lose any of the temporary commits. If we prepare an unpublished change through numerous rebase operations, all that temporary crap will stay referenced from the head, waste space and confuse other people with irrelevant information when they try to navigate the history.
> history does not matter if the change was parented in some temporary context
It does if it means a big ball o' hackage lands on the public working branch, since it complicates merges, backouts, cherrypicks, and bisects.
Git users can also hide individual commit messages behind one big combined message, losing part of the project's development history and logical progression.
When I pull your repo and build it, and I find that it doesn't build on my system, I don't want to dig through a 500-line merge commit to figure out why you changed this one line from the one that used to build last week, I want the 14-line diff it was part of so I can begin to understand what you were thinking when you committed it. If I later find out that that 14-line change was wrong but the rest of your 500-line merge was fine, I want to be able to back it out with a single command. (In Fossil, it's `fossil merge --backout abcd1234`.)
> confuse other people with irrelevant information when they try to navigate the history.
How much time do you spend navigating the project's history vs looking at the tip of the current branch?
I'd wager that the times you dig back into the history, it's because you are in fact trying to figure out why you got here, which means a trail of detailed breadcrumbs will be more likely helpful than "...and between one week and the next, something changed in commit abcd1234, but we've lost all of its internal context, so we'll be spending next week reconstructing it because Angie's on vacation now."
It's reasonably common for me to start exploring a problem space, stub out a concept, and have a long drawn-out conversation with the compiler that touches many files, before finally reaching a point that is working enough to be interesting.
At that point, I can take a step back and note that actually, not all of those changes have to be made all at once, and I can break that patch up into a bunch of simpler pieces.
In Git, I have two essentially-equivalent choices:
1. stage the commit in small chunks, adding a separate descriptive comment for each.
2. commit everything as a WIP commit so this working state is in the reflog, then break it up into smaller commits with interactive rebase.
In either process, I'm able to get a better comprehension of my own thoughts along the way.
Fossil, refusing both staging would force me to commit the proverbial 500-line blob all at once, which is less helpful to the reviewer trying to discern my thought process.
If rebasing isn't important to your workflow, Fossil probably is a better choice for you than Git. It has a lot of comforts that I really appreciated even eight years ago when I did frequently use Fossil (distributed wiki and tickets are really nice, the web interface serving raw artifacts from any point in time is great for HTML5 game jams, SQLite is a very portable repository format, etc. etc. etc.), and I'm sure it's only improved in that regard. The only reason I use Git for personal projects is because rebase is that helpful to my process.
> Fossil...would force me to commit the proverbial 500-line blob all at once
Nope.
If it were me doing such a thing as you describe, I'd start the work on a feature branch. If I'm working on that repo with other active developers, this lets them see what I'm up to and possibly help; and if not help, then at least be aware about where my head's at, so they can better predict what's likely to land on the shared working branch later.
If I got to a point where only part of the branch needed to be applied, I could cherrypick those individual changes, either down to the parent branch or up to a higher-level feature branch.
All of this happens in public, with the work fully recorded, so someone doesn't have to reconstruct the development history after the fact later.
This mode of development helps keep your project's bus factor above 1.
You can commit code that doesn't compile onto a feature branch so people can see what you're up to, I guess, but I don't see that helping with bisecting later, and I wouldn't expect the commit messages to be useful.
To be clear, my typical approach is certainly to commit every time I return to a working state. But in more experimental modes, I often reach that the long-way-around and have ended up with multiple semantic changes I wish to break apart for study.
Unless I'm missing something and Fossil has gained the ability to cherry-pick selective lines from a commit.
> It does if it means a big ball o' hackage lands on the public working branch, since it complicates merges, backouts, cherrypicks, and bisects.
Rebase on a local working copy is normally used for cleaning up a string of commits that is messy and/or separating commits that mixed multiple logical changes together.
Local commit history before push is arbitrary. There’s nothing sacred that needs to be preserved about the exact order I typed things into each file, that’s not what I want from a version control system.
Personally, I haven’t really seen use of rebase complicating merges, reverts, cherry picks, or bisects. I can imagine ways it can happen, but I haven’t seen it be a problem in practice. However, I have seen cases where failing to rebase caused problems. Allowing build breakage between two commits is an example where bisect is affected, and squashing the fix into the first commit before push is much preferred.
So, anyway, your example feels totally contrived.
> How much time do you spend navigating the project’s history vs looking at the tip of the current branch?
This is a false dichotomy. I need both. I happen to navigate project history quite a lot, like multiple times per day. In addition to how I got here I usually need to know who changed it, so I can talk to them.
If you try to use Fossil 1.37 — the last 1.x release — to clone a repo that has SHA-3 hashed artifacts in it, it says, "server returned an error - clone aborted". Since 1.37 pre-dates this feature, it can't give a more detailed diagnosis than that.
If you have an old clone made from before the transition and try to update it, I'm not sure what it says, since I don't have any of those around any more. It has, after all, been three years since Fossil began to move on this problem, so that it's largely a past issue for us now.
This transition time was indeed annoying for us over in Fossil land, but Git's going to have to go through a transition like this, too. The question isn't whether but how long we'll have to wait for it to begin and how long it'll take to complete.
In large measure, you actually can't, since there's a good chance those repos are behind HTTPS-only these days, and those versions of Git will be linked to ancient versions of OpenSSL that won't even talk to modern TLS implementations, the two being unable to agree on a common ciphersuite.
Beyond about 10 years, you usually end up freezing old binaries in place along with old data in order to continue manipulating it anyway.
As others have pointed out, there already is precedent for ^{...}, so if you're comfortable with the other uses, I'm not sure why you should NOT be comfortable with this new addition.
Is a significant part of git's typical profile spent computing hashes? I'm genuinely asking because I don't know the answer. I'd expect all the diffing and (potentially fuzzy) merging to be significantly more expensive operations, at least as far as big-O is concerned.
> Is a significant part of git's typical profile spent computing hashes?
No.
Hashes are really cheap.
This annoys me a bit, because every discussion about hashing goes into endless bikeshedding which hash function to use. The simple truth is: SHA2, SHA3, Blake2/3 are all good enough from both a security and performance perspective that for almost any use case and the advantages and disadvantages are so minor that it really doesn't matter.
Length extension is an unnecessary problem in MD constructions. It makes sense to get rid of the problem. So if you are building a new thing today there's some sense in not picking SHA-256 in order that you won't later hit your head on a length extension attack. SHA-512/256 (that's not a choice, it's just one hash in the SHA2 family) is a reasonable choice though, and of course if Git was vulnerable to length extension somehow they'd be in trouble years ago so for them why not SHA-256.
The length extension attack is a non-issue for Git’s use case, and SHA-256 (unlike SHA-512) benefits from having hardware acceleration in the new Ice Lake Intel chips (as well as on the AMD side of things), and has been around 11 years longer than SHA-512/256. And, yes, there are places which say “If you will use a hash, you will use SHA-256”.
Personally, the last time I was in a place where I had to choose which cryptography to use, I used SHA3’s direct predecessor, RadioGatún, because I needed a combined hash + stream cipher and, at the time (late 2007), RadioGatún was the only option.
RadioGatún also benefits from being about as fast as BLAKE2 (it would be faster in hardware, FWIW, having SHA3’s hardware advantages), and is approaching 14 years old without being broken by cryptoanalysis. Also, unlike BLAKE2/3, and like SHA3 and all sponge functions, it’s computationally expensive to “fast forward” in RadioGatún’s XOF (stream cipher, if you will) mode, which is beneficial for things like password hashing. Another nice thing about RadioGatún: It doesn’t have any magic constants in its specification, allowing a useful implementation to fit on my coffee mug, e.g.
If someone asked me which hash algorithm to use, I would suggest SHA-256, unless I think they needed protection from length extension attacks (so SHA-512/256), or needed an XOF (stream cipher-like) construction (so SHAKE256).
If performance mattered more than a conservative security margin, BLAKE3 (software performance) or KangarooTwelve (SHA3 variant; excellent hardware performance) would be good choices. If I were to do choose a hash + XOF for use today, I would use KangarooTwelve’s variant with a little larger security margin: MarsupilamiFourteen.
Cryptographically strong random numbers in MaraDNS 2.0. The hash nature of RadioGatún allows me to combine multiple entropy sources with varying amounts of randomness together to seed it then use it as a stream cipher to generate good random numbers. This way, the DNS query ID and source port are hard to guess, making blind DNS spoofing harder.
The nice thing about RadioGatún is that it only takes about 2k of compiled code (and can fit in under 600 bytes of source code, as seen in the parent) to pull all this off.
This was the best way to pull it off back in 2007, when RadioGatún was the only secure Extendable-Output Function (XOF) that existed.
Linux also has the ethos to choose boring technology. SHA2 has been here for so long and battle tested. For the majority of us, it is the natural choice. I'm not implying anything negative about SHA3/Blake/Keccak.
The decision was made before the release of Blake3. The article did mention the algorithm is no longer hardcoded (hence the ability to support both SHA1 & SHA256). This means it's possible to transition to Blake3 (or any other) in future, though it won't be trivial.
Of course, processors that use one of the Atom/Celeron/Pentium microarchitectures are not the best choice if you desire maximum speed, but otherwise they are surprisingly interesting processors (IMHO much more interesting than what Intel delivers with the Core series).
At this time, Intel often experiments with or introduces features that are particularly interesting for embedded usages first on the Atom. For example the already mentioned SHA-NI. Another example are the MOVBE instructions (insanely useful if you handle big-endian data, for example in network packages (I am aware that on older x86 processors, there exists the BSWAP instruction)) - they were first introduced with Atom.
There are organisations that can only use approved crypto for various certifications and government contracts. It would be bad to drive such users away from git.
Does anyone know if a standard format for sort of tagged-union hash type, something similar as crypt format for passwords? Feels like everyone is needing to support multiple hash types at some point, and basically needs to reinvent that particular wheel again and again.
It isn't too bad to just exhaustively look up provided hashes in all your databases (at least, for Git). You should probably only support 1 primary hash at a time, and 1 additional legacy hash for migration purposes. This makes lookup twice as expensive; for git, this is not usually the slow part (the slow part is 'git status' having to compare the entire local filesystem checkout to the repo).
The above article suggests that Sha-1 collision is infeasible because attacker has to come up with code that not only generate same hash but also benefit him. But can't he just add some malicious code and add some random text in comments to produce same hash?
"produce same (specific) hash" is a pre-image attack, which is very very hard. So hard, that even MD5 isn't broken for pre-image, and there's only a theoretical pre-image attack against MD4.
We know only collision attacks which is "produce 2 files with the same hash, but you can't control what hash". So you can't target any existing repo. You need to use social engineering to get one of your special files into a repo.
Stating the obvious but the hash is a hex, that leaves lots of characters for a one character prefix for sha256 hashes. Like the character "s" for instance.
The problem is in if you can make evil code with the same hash as innocuous code, you can poison people who pull from a given repo you have access to. It would allow you to make changes to the history without merging anything or anyone being the wiser.
It makes the distributed aspect of git untrustworthy, as previously you knew if you pulled from anywhere and the hash was good, you’d pulled the correct code. With SHA1 being functionally broken that’s no longer necessarily the case.
OP's link says it is "subscription-only content," but it is still publicly available. It says that it has been "made available by an LWN subscriber." How does that work?
Why would they support it? The article clearly states it is nowhere close to being useful yet.
It is untested, unstable code that can only write to repositories and not read them.
"Much of the work to implement the SHA‑256 transition has been done, but it remains in a relatively unstable state and most of it is not even being actively tested yet. In mid-January, carlson posted the first part of this transition code, which clearly only solves part of the problem:
"First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA‑256 repository, but will be unable to read it. "
"First, it contains the pieces necessary to set up repositories and write _but not read_ extensions.objectFormat. In other words, you can create a SHA‑256 repository, but will be unable to read it. "
I am not a fan of SHA-256, you are better off with SHA-386 or SHA-256/512 which resist prefix attacks and are actually a little fast on 64 bit machines.
Unless I'm missing something, why not just allow repositories to be upgraded to SHA2 hashes? The only problem is ensuring everyone's tooling supports it.
I don't think it's that unreasonable to release git binaries today with sha256 support, then wait 5 years, then make all new commits use sha256.
Anyone who tries to use a git client more than 5 years old wouldn't be able to pull+push to a new repo. Sounds reasonable to me. Git clients more than a few years old are pretty broken already due to TLS changes.
Keeping around a dual hash system forever sounds like baggage and complexity that outweighs the benefits.
It isn't the easiest article to read, plus they over complicate things by talking about things such as truncating SHA2 hashes.
I don't see why changing the hashing algorithm is so problematic, hence the reason why I asked the question. Converting a repository to SHA2 should be straight forward (the only issue is everyone's tooling), you could also run the repositories side-by-side. I'm genuinely interested as I think Git & Bittorrent are quite elegant solutions to complex problems.
Exactly! If you've ever worked in a corporate environment, you know the fun of having to support 10-year-old versions of your favorite cutting-edge software.
> Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit.
Color me surprised, dropping the "blockchain" word in the middle of the introduction
It is still a sort of namedropping. In the sense that it is used due to the trendiness of the term.
It is entirely possible and likely that it is used for didactic purposes as many people are familiar with the blockchain structure and its use of hashes.
I thought it was a joke. The whole: blockchain isnt that inovative if you use a strict technical definition because lots of things are a chain of blocks before bitcoin was cool, meme.
And honestly fair enough. The inovative part of bitcoin is not the blockchain but all the economics & game theory going on to create trust in the system
I think this is the reason why it the parent was criticizing it. Blockchain as a term generally mean "crypto-magic-stuff on a blockchain" so for git to use it instead of a more academical Merkel trees (or Merkel DAG if they exist) sound a bit like low effort name dropping.
Again it is not a criticism of the article, but it is not a criticism of the criticism either.
Rewriting the whole thing including every git repos history seems like throwing the baby out with the bathwater, when you could just add a secondary transparent verification instead. Just seems like there has to be a better way.