In The Black Swan, Nassim Nicholas Taleb argues that applying game theory to real life situations is one instance of the ludic fallacy. https://en.wikipedia.org/wiki/Ludic_fallacy
Fair warning: N. Taleb is more or less an anti-intellectual. He hates on most formal models he didn't write himself and most of his books are musings on heuristics.
When you first write design your program, you might indeed use a different data structure than a linked list when you anticipate 10k+ elements.
But often programs are used long after they are designed, and it is great if they can degrade gracefully as you move outside their original design parameters.
True, irrational shared a first hand experience and then proceeded to speculate as if that experience was evidence that smoking is more prevalent in Canada than in the US.
If you have a repo with a lot of GPG signed commits, or you just don't want to change all your commit IDs (because you reference them in other places), then it'd be very valuable to be able to have a repo that's mixed old and new hashes.
Also your Git binary, if compiled with only the One True Hash™, wouldn't be able to work with older repos at all because the hashes it's calculating are now different.
(Edit: Another benefit of generalizing this is so that if/when, in the future, the new hash algorithm must be abandoned due to weaknesses, Git tooling will have been already introduced to the notion that hashes can be different and should hopefully be a less involved migration the next time around)
> Yes, this is correct. The struct object_id changes don't actually change the hash. What they do, however, is allow us to remove a lot of the hard-coded instances of 20 and 40 (SHA-1 length in bytes and hex, respectively) in the codebase.
No. We're talking about zoren's hypothetical case where git used a typedef from the beginning, instead of littering char[20]s all over the tree.
My prior comment was explaining why jffry's complaint is nonsensical (a typedef does not prevent moving from a single hash model to a multiple simultaneous hash model).
Hmm, not sure where there disagreement is, that's exactly what I'm saying. Obviously it wasn't done from the beginning, but the code is now changing from the char[20]s everywhere to the typedef precisely to be able support multiple hash functions.
That's exactly what this change is. You mean why wasn't it that way before the change? Maybe because it wasn't ever needed before? Git's been good with only sha-1 for 12 years. Think about the flip side of your question... what were the alternatives 12 years ago, or 5 years ago? And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?
In my experience, generalizing ahead of need more often than not causes problems, and I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.
> Think about the flip side of your question... what were the alternatives 12 years ago, or 5 years ago?
SHA-2 and RIPEMD.
> And why would someone write code for alternatives that aren't expected to be used and maybe don't exist?
That's the problem: the software industry is still suffering from MD5 getting cracked [0]! Cryptographic agility is a baseline requirement for security primitives.
> In my experience, generalizing ahead of need more often than not causes problems
I agree and Linus has valid complaints about security recommendations during the 25-year history of Linux: most of the security recommendations kill performance and are only partial fixes, so why bother?
But Linus is also engaging in premature optimization: computers are ~30 billion times faster than when he first starting programming Linux. Yes, SHA-2 is relatively slow, they could have at least not hardcoded SHA-1 into the codebase and protocol.
> I've watched over-engineering result in far more effort to fix when the need it was anticipating does arrive than just waiting until the need is there.
You clearly haven't done any safety related engineering. That's the thing about cryptography: millions of dollars and human lives are at stake. Despite the smartest people in the world working on these problems, cryptographic primitives/protocols are regularly broken. Due to Quantum computing, every common cryptographic primitive we use today will need to be replaced or upgraded at some point.
Thankfully, you don't need to worry about the engineering of a given cryptographic primitive as long as you can swap it out with a new one. But when you hardcode a specific hash function and length into your protocol/codebase you are now assuming the role of a cryptographer.
Even totally ignoring that SHA2 was a thing, anybody looking around would have noticed that MD4 was broken, MD5 was broken, and it would be unlikely that the hash of today would stand forever.
Yes, true. Correct. That is still true, and applies to SHA-2 as well. And Linus was aware of exactly what you say, back in 2005.
My point was that the choice that was made was considered good enough for the purposes for which it was intended. In the context of the OP's comment, criticizing git for not making different code design choices doesn't mean that Linus was wrong, it may mean that the OP doesn't know and/or understand all the considerations Linus had. And Linus has said many times that the security of the hash is not the primary consideration in his design.
Git's choice of SHA-1 was not at the time predicated on having the single most unbreakable hash in existence, the hash's use is not for security purposes, and to talk with such incredulity about Linus' choice may be to misunderstand git's design requirements.
Well, this whole mess proves that Linus was wrong. Typing "unsigned char [20]" everywhere is beyond amateurish to me in any case and raises a concern about the overall quality of code in git and linux kernel.
It shows that Linus is not a cryptographer, to be more precise. Though yes SHA-1 chosen prefix attacks are still very expensive at this point. I wonder how many non-cryptographers knew about SHA-2 back in 2003-2004.
No it's more than that. "unsigned char[20]" already has at least three potential points of failure (and why isn't it uint8 anyways). Moreover, it'll be referenced as unsigned char*, which opens another can of worms. And oh, by the way, have fun searching all references to sha1 on your source code now that you weren't pro enough to create a type for your object ids / hashes.
I'm guessing it's part lack of skill in design, part bad software development tools (uEMACS and makefiles or something), and part just being against c++ et al.
Linus regularly treats security as a second-class citizen and is famous for his outrageous harassment [0]:
> Of course, I'd also suggest that whoever was the genius who thought it was a good idea to read things ONE FCKING BYTE AT A TIME with system calls for each byte should be retroactively aborted. Who the fck does idiotic things like that? How did they noty die as babies, considering that they were likely too stupid to find a tit to suck on?
He deserves to eat this shit sandwich.
> I wonder how many non-cryptographers knew about SHA-2 back in 2003-2004.
Any systems engineer should have known about SHA-2. SHA-1 only provides 80-bits of security, so everyone else assumed that it would need to be replaced.
Linus has explained why he picked SHA-1. I'm not Linus, and I'm not defending his choice, but he has said repeatedly that git's hash is primarily for indexing and error correction, and not primarily for security. Clearly he felt like SHA-1 was "good enough". And if you have something that's "good enough" there are reasons not to write code for alternatives you're not going to use.
>but he has said repeatedly that git's hash is primarily for indexing and error correction, and not primarily for security
And he was wrong as openpgp signatures on commits and tags are a thing.
Not sure when that feature was introduced however, I doubt that it existed in the first version of git. That being said he should have changed the hash function the moment that feature was introduced.
The first attack on SHA-1 was published in 2003. Git showed up in 2005. Not only should git have allowed for something else, it never should have used SHA-1 in the first place.
Backwards compat requires that both old and new hashes work at the same time. A simple typedef is unlikely to handle all the semantics and space needed for such a change...
It is often hard to generalize when N=1. Now that the N=1 use case is established and we are moving towards N=2, it is painfully obvious to all that a better abstraction is needed.
Typedef or no, we would still need a full audit of the code to find spots where people "inlined" the expansion.
IMO, Linus should have done better here -- no crypto hash lasts forever, but this code is far cleaner than useless layers of abstraction.
I read those comments more than a decade ago. They seemed weak but tolerable then. They seem broken now. Git is supposed to guarantee that the code I see is the code the author saw, in a distributed and decentralized environment. This is Git's entire reason for existing.
A secure design is essential for trusting this functionality. My trust in Git has always been tempered by the weakness of SHA1.
A GPG signature is no stronger than its object ref.
Have you seen how many frameworks believe "auto-pull and compile deps by hash from github" is reasonable? They are assuming this isn't a massive attack vector. They are trying to build on a core feature that Git claims to have.
Recent events moved this from probably foolish to provably so.
That's what comes to mind every time someone brings up Linus' comments from way back when. If SHA-1 is insecure, then there is no way to have security. Forge an object, and GPG sign its commit, and you have broken the apparent security GPG signing was meant to bring. If SHA-1 was not meant for security, then security must have been a non-goal of Git.
The comments are brought up usually to explain why Linus didn't think much of it at the time, whereas they actually demonstrate the shift of thinking around what Git is meant to provide. Security is definitely a goal now, and the hash function is the critical piece of security infrastructure.
GPG signatures actually sign the hash digest of the text they're given. Fun fact, which I think (hope) changed in recent versions of GPG: the hash, by default, is (was?) SHA-1.
Interesting, I didn't know! Although it makes a lot of sense now that you bring it up.
I don't think it changes anything though, because of git's integrity. Stop me if I'm getting this wrong but, if you wanted to attack a signed git commit through the gpg signature's hash, you would have to modify the commit object itself... which yields a different commit hash in order to be valid. You'd have to get absurdly lucky to have a signature collision that contains a (valid) commit hash collision.
object $sha1
type commit
tag $name
tagger $user $timestamp $tz
$text
If you wanted to attack a signed git commit through the gpg signature's hash, you would have to do a second preimage attack on that text with a different commit sha1.
OTOH, if you wanted to attack a signed git commit through the git commit sha1, you would have to do a second preimage attack on that commit text, which is of the form:
See where I'm going? it's the same kind of attack.
Another way to attack it would be to do a second pre-image attack on the pointed tree, which is harder because there is not really free-form text available in a tree object.
Yet another way to attack it would be to do a second pre-image attack on one of the blobs pointed to by a tree, where the format is of the form:
blob $length\0$content
I don't think this is significantly easier than any of the second pre-image attacks mentioned above.
So, in fact, in any case, to attack a gpg signed git tag, you need a second pre-image attack on the hash. If git uses something better than SHA-1, but GPG still uses SHA-1, the weakest link becomes, ironically, GPG.
That being said, second pre-image attacks are pretty much impractical for most hashes at the moment, even older ones that have been deemed broken for many years (like MD5 or even MD4 (TTBOMK)).
That is, even if git were using MD4, you couldn't replace an existing commit, tree or blob with something that has the same MD4.
Edit:
In fact, here's a challenge:
Let's assume that git can use any kind of hash instead of SHA1. Let's assume I have a repository with a single commit with a single tree that contains a single source file.
The challenge is this: attack the hypothetical repository using the hash of your choosing[1] ; replace that source with something that is valid C because people using the content of the repository will be compiling the source. Obviously, you'll need the hash to match for "blob $length\0$content" where $length is the length of $content, in bytes, and $content is your replacement C source code.
But for the commit it's different, because the $text in your example affects the hash of the commit itself. And my understanding is that if you sign the commit, you're signing both the contents and the hash of the content. Am I incorrect?
Yes, Linus wrote that SHA1 isn't here for security, but that was a glaring misunderstanding of security on his part. Integrity protection of source code is a security function.
I think it's mainly due to a different threat model. Linus only pulls from his trusted lieutenants, who are unlikely to try to attack the source in that way (it's way easier to simply hide a bad commit in the lot, no need to fiddle with SHA1). They do the same.
The rest of the code is sent through mailing lists as patches, so the hash is irrelevant.
SHA1 here protects against "random" corruption (which is more than some VCS do), but not an attacker. At no point one is able to send trusted contributors bad commit objects.
Now, the use people have of git is very different from the kernel (or git) style, so their threat model is different, and SHA1 may become a security function.