The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision.
I didn't get a chance to make this point in that other thread, because the thread [1] of its follow-ups quickly morphed from promising [2] to meandering, but the combination of lax formats (PDF and JPEG in this instance) makes this style of collision particularly reductive, and in a sense, a cheapshot, if still devastatingly damaging given both PDF's and JPEG's ubiquity -- both separately and together -- in document storage and archival.
This shows the importance of techniques like canonicalization and determinism, which ensure that given a particular knowledge set, that result could only have been arrived at given exactly one input. For general-purpose programming languages like PostScript, of which PDF is a subset, this is essentially an unfulfillable requirement, as any number of input "source code" can produce observationally "same" results. Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.
> Constrained formats, and formats where the set of 'essential information' can be canonicalized into a particular representation should be the norm, rather than the exotic exception, especially in situations where minute inessential differences can be cascaded to drastically alter the result.
That might be very challenging in practice, because a more expressive language directly allows a more compressed/efficient encoding of the same information, but at the cost of being more difficult (or impossible) to create a canonical representation.
Also, data formats that are purposefully redundant for error tolerance all basically have the property that readers should be tolerant of non-canonical forms. If we want to redundantly represent some bytes redundantly in case of data loss, then there must be multiple representations of those bytes that are all acceptable for the reader for this to work.
Video and image formats use multiple encodings to give encoders the room to make time-space trade-offs.
I agree, for anything that a human is supposed to see with the eyes, there are always different representations that look the "same" enough to be indistinguishable.
People should be aware of it, not believe in a non-existing world where it isn't so.
Add ELF [1] and Zip [2] to the list. Many common file formats have areas where you can insert an arbitrary chunk of data without significant side effects.
[1] ELF allows for a very flexible layout, and is almost certainly vulnerable to this length-extension-based attack.
[2] Zip allows a comment at the end of the central directory. Since the central directory is at the end of the file, I don't know if it's vulnerable to this exact attack.
> This is an identical-prefix collision attack, where a given prefix P is extended with two distinct near-collision block pairs such that they collide for any suffix S
The near-collision block pairs is the difference everyone can see in the image. Whoever created the PDFs did everyone the courtesy of keeping the suffix the same. There's numerous examples already of different PDFs with the same hash.
Excellent! This ought to be the top comment; have you submitted it separately?
I was considering implemented the same thing, since there wasn't any reason for those limitations. The Google researchers purposely designed this collision to be highly reusable, probably to encourage everyone to generate lots of fun examples which will be spread widely and break systems everywhere :)
I'm surprised that some PDF renderers have problems with JPEG streams that contain comments and restarts. Actually, glancing at the JPEG spec, I didn't even realise that restarts would be needed, I thought this was just done with comments. Is this really "bending" the spec, or is GhostScript buggy, or is the problem that it's assuming that restarts don't occur inside comments without escaping?
I did submit it, but I don't think anyone saw it: https://news.ycombinator.com/item?id=13729920. I was trying to finish it in time to catch daytime folks but work got in the way :)
Comments are limited to 65536 bytes, and the JPEG spec doesn't offer any way to break an image stream up except for progressive mode or restart intervals (otherwise, the image stream must be a single intact entity). The trouble is that it's probably not _technically_ legal to stick comments before restart interval markers (putting comments _after_ the markers just breaks all the readers since presumably they are expecting image data). So, GhostScript's JPEG reader gets confused when it fails to see a restart marker immediately following an image data chunk, and aborts decoding.
Wow, it works! I thought this was supposed to require Google-level resources and months on processing time. Did the initial collision somehow enable more similar ones?
Yeah, I think that's pretty much the case. The first 320 bytes of the two PDFs released by Google result in the same SHA-1 state. Once you're at that point as long as you append identical data to each of the files you're going to get identical hashes. This is just taking those same 320 bytes and appending the combined images of your choice.
edit: as versteegen points out it's 320 bytes, not 304.
I learned a lot from it. One thing is that this property is true of any Merkle-Damgård-type hash if the hash internal state is the same size as the hash digest. This is true of SHA-1 and of several other famous and widely-used hashes, but not true of every hash, including some of the most recent designs like several SHA-3 candidates and SHA-3 itself. In a hash without this property, you can have a collision condition H(X)=H(Y) (and len(X)=len(Y)) yet typically H(X+a)≠H(Y+a).
Edit: len(X)=len(Y) is also necessary because Merkle-Damgård hashes encode the message length into internal padding, so if you happened to have two colliding inputs that were different lengths, they will generally not produce a collision when the same string is added to each.
This is really good to be aware of, even if there were no collisions. I could imagine someone making for example a signed cookie scheme that is value,SHA1(secret,value). Someone could then change it to value+foo,SHA1(secret,value+foo) without knowing the secret, and it would verify as a valid signed cookie.
People sometimes overstate the impact of length extension attacks. If your format has a length prefix (really common) then you may well be "vulnerable" in the sense that appending arbitrary data is "valid", but a canonical form without the appended data is trivial to construct; and indeed most software would likely completely ignore that extra data.
HMAC is a neat trick to avoid length extension attacks (and other issues) in a generalized fashion, but that doesn't mean those risks actually apply in practice. (Some googling finds e.g. this paper: https://www.iacr.org/archive/fse2009/56650374/56650374.pdf which proposes an attack on length-and-key prefixed messages, using some sha1 weaknesses and merely over 2^84 memory and 2^154 queries - color me impressed, but not scared). Edit: just to be clear, I'm not suggesting anyone actally use LPMAC-sha1 given the current state of sha1.
For another example; in general it's unsafe to truncate a "secure" hash - hashes that satisfy most security requirements can be constructed that are not safe when truncated (e.g. sha3 prepended by zeros is still safe, but obviously not if truncate the sha3-provided bits off). But I don't know of any mainstream hash where this theoretical risk actually applies (e.g. no merkle-damgard hash suffers from such a risk); nobody constructs hashes intentionally with more bits than entropy.
It's probably still wise to stick with known-good constructions, but the risks seem overstated, and the difficulty is also overstated - assuming the primitives used aren't too flawed. Sure, it's cool that HMAC can use even flawed things like MD5 and retain safety, but typically nobody is forcing you to stick with md5. I guess the world is more complicated if you need to pick a protocol and then you're unable to change it, but most applications can (with some effort) be changed. You need something safe now, not for all eternity.
So, I think the rule is simpler: this has little to do with crypto per se; just don't be unnecessarily clever, in general. Crypto makes the consequences particularly nasty, often. But that's about it.
This was good meme that served its function well when it was needed - early enthusiasm for reusable cryptographic primitives and a failure to recognise the foot-shooting potential lead to many easily broken schemes.
Now, however, "don't roll your own crypto" is dogma, and if anything we have the opposite problem of monoculture and slow progress. I think a more nuanced view is required, one that encourages experimentation when the stakes are low and more competing implementations when the stakes are high (or perhaps we should call them "complementing" - a standard ought to have multiple implementations).
As Wikipedia puts it, "Mathematical analysis of [security] protocols is, at the time of this writing, not mature... Protocol design is an art requiring deep knowledge and much practice; even then mistakes are common." How are programmers to practice, if they are not allowed to fail?
That's not quite what he's saying. He's saying it's using the same research but you can't reuse a hash collision like that since the contents of the two images you give it are unknown before hand.
You can reuse a collision by appending identical data to both pieces of colliding data and the result still collides. The collision google found exists in the PDF before the JPEGs, and both JPEGs are appended after the collision -- the difference in the earlier data is where the differing instructions exist that say either to display the first or the second JPEG.
You can reuse the PDF from Google because it's not particularly reliant on the contents of the images. You can replace the images that Google used with your own (indeed that's what the service does) and it will output two files that themselves have the same hash (but that hash will be different than the original provided PDFs)
Interesting. I need to read more into this because I didn't think that kind of thing was possible. I thought you'd have to recalculate it from scratch because of the two images being different but apparently I didn't understand how this worked.
If an attacker can hash a block that means "goto showImage1" and "goto showImage2" with the same hash, then you can see that the contents of those images doesn't matter, as long as the data for those images occurs after the logic for choosing them.
(and no that's not what this attack does in reality, but just a logical framework for understanding why the images don't matter)
Is this the first hash function which went from "secure" to collision-as-a-service in a matter of days? Was sha1 particularly weak, or the published research particularly strong? or maybe something else?
Absolutely, but knowing something smells a little off vs having a documented collision feels like the point where we we go from "No longer advised for use" to "It's broken".
Usually its when we hit "It's broken" that average Joe developer/operator/etc starts to form their migration plan, and usually they have some time before the attack becomes reasonable to execute outside of pure research / "nation state" attacks.. It seems that block of time was just a few days (a day?!) for SHA1.
And it's that, even if we're talking about an easy to abuse format like PDFs apparently are, that makes me question what just happened? Was sha1 secretly terrible? was the research just that good? or was there some other factor that allowed this to happen?
Also, sure, the major browsers have a date for SHA1 as a TLS signature hash retirement.. But SHA1 is used for a whole pile of other stuff with absolutely no transition plan! January 1st was absolutely never going to be the last day people used SHA1.
As other people explained, and just to summarize, this service is a way of reusing Google's specific collision (as a prefix to new collisions), and isn't a way of making arbitrary colliding files or file formats. You can't, for example, use this to make something other than PDFs that collide; for that, you'll have to redo a computation on the scale of Google's!
I thought I heard that some file formats have "headers" that go at the end of the file. I think a demo of this was a file that was both a zip and a PNG or something. If I remembered right, you should be able to make a similar hack here.
If the only headers are at the end, then yes, that's a really neat idea. I'm not sure of the constraints for zip files. Maybe it would be interesting to brainstorm with some people with a lot of file format knowledge to find additional such formats. But if the formats have any constraints at all on the first few hundred bytes, those generally won't be satisfied by the prefix here.
Why wouldn't you be able to? Most file formats have ways of including comment data, including JPEG, GIF, HTML, C code, and most others. You could potentially create a piece of noble C code and a piece of malicious C code that collide. (However, creating a piece of malicious code that matches an existing codebase that doesn't contain a comment would be hard, if I understand correctly.)
It's a length extension on the beginning part of the PDF. There are 2 headers that have the same hash. As long as you append the same suffix to both of those headers, the hash will be the same. In this case the headers happen to contain a switch that select between one of two images. So the length extension is adding both images to both of the headers. Since the headers have the same hash and the suffix has the same hash, the overall document has the same hash. But because of the switch in the header you see two completely different contents.
So basically H1 and H2 have the same SHA1 hash. By adding suffix I1I2 to both you get H1I1I2 and H2I1I2. That's the length extension.
If you got two different messages to get into the same internal state with SHA-3, the same thing would apply, wouldn't it? Would that be a length extension attack on SHA-3?
This is really a terminology question. I had a clear understanding of "length extension attacks" but it seems on this comment page people are using something else now. I've been looking over crypto.stackexchance and twitter to see if I missed the memo but this looks like a new usage.
It's a different usage of the term than the normal case, but it takes advantage of the same vulnerability to length extension. It doesn't necessarily apply to SHA-3 and other sponge functions because there is a difference there between the internal state of the hash function and the actual hash itself. The internal state is much larger than the hash output for a sponge function like SHA-3, which means you can get two messages that have the same hash without having the same internal state, and therefore appending a suffix would mostly likely change that internal state enough to no longer have a hash collision.
> It's a different usage of the term than the normal case
Is it? How? It's a simple case of length extension, just that here, since we have two independent starting points sharing the same state, we start with a collision and we extend to a collision.
In other words, these are two length extensions on independent prefixes. It just happens that these prefixes share the same state / hash, hence the surprising result (on a first glance).
Normally when talking about length extension attacks the original plaintext is unknown, but we can compute the hash of the plaintext plus an extension if we know the hash of the plaintext. In this case we know what the plaintext is, and we happen to have two different texts that produce the same hash, which we can extend to generate many collisions. It's the same property but it's a different scenario than what is commonly referred to as length extension.
Both SHA-3 and BLAKE2 are not susceptible to length extension. Keyed BLAKE2 is actually just a prefix MAC (which would be completely unsecure with SHA-1/2 / classic Merkle-Damgard).
SHA1 has been deprecated for a while. It certainly wasn't considered "secure" on the day the attack was announced. For example CAs have been moving away from SHA1 signatures for a few years.
> identical-prefix collision attack, where a given prefix P is extended with
two distinct near-collision block pairs such that they collide for any suffix S
They have already precomputed the prefix (the PDF header) and the blocks (which I'm guessing is the part that tells the PDF reader to show one image or the other), and all you have to do is to populate the rest of the suffix with identical data (both images)
Yep. With any hash function that takes input as blocks, if you were to ever get two messages to generate the same EDIT internal state, you could add the same arbitrary data to both and get those new messages to have the same hash.
As sp332 and JoachimSchipper mentioned, the novelty here is that it contains specially crafted code in order to conditionally display either picture based on previous data (the diff). I can't grok PDF so I still can't find the condition though. Can PDFs reference byte offsets? This is really clever.
Edit #2: I misunderstood the original Google attack. This is just an extension of it.
Yes, it's a length extension. Both images are in both outputs. Both outputs also contain a conditional switch to choose which image to show based on the previous data, where the collision lives.
It seems so. I can add the same arbitrary data at the end of two pdfs generated by this tool, and they are still a collision. I didn't know SHA-1 is so susceptible to length extension. Is there no internal state in the algorithm that would be different even if the hash output is identical?
If you were to somehow get two messages with the same SHA-3 hash, you could keep on appending the same data to both and they would keep the same SHA-3. But SHA-3 is explicitly not vulnerable to length extension attacks.
The length extension attack leverages the weakness that people think HASH(secret + message) is a signature only they can create as long as only they know "secret".
I don't understand the whole collision thing. I mean a sha1 is 160bits so if you are hashing information longer then that collision is a fact, being able to forge a piece of information with constraints is the challenge and even that with enough power you end up being able to try all the combinations. What I understand from that collision reported is that they use PDF format which can have as many data inserted to it as comment /junk as you want so all you need is enough processing power to find the right padding/junk to insert to get the collision. Am I missing something here ?
If by "secured by SHA1" you mean "someone generated a hash using SHA-1 and we use the validity of that hash to guarantee we have the right document," that's still okay. We're a long way from being able to make documents with a given SHA-1.
(Edit: Any newly signed documents, or documents signed recently, are not safe, because an nasty person could have made two, one to get signed by the system, another to do evil with.)
SHA-1 is officially deprecated for certificates, because of the example that OP shows. You can create two certificates, have the decent one get signed by a CA, and then use the evil one to intercept traffic.
Thanks for the info. Good point, I suppose anyone relying on SHA1 in 2017 has had ample warning about its weaknesses.
It seems that there is also a very strong incentive for anyone receiving anything whose authenticity is verified by SHA1 to request an improved hashing algorithm.
Practical question: does this generate a "harmful" (harmful to a repo system like SVN) PDF if the flaw in the hashing is enough to crash/corrupt the system?
The OP_2DUP OP_EQUAL OP_NOT OP_VERIFY OP_SHA1 OP_SWAP OP_SHA1 OP_EQUAL script was making sure that only a person who finds 2 SHA1 colliders and publishes it can get the 2.5 BTC bounty.
I just tested Backblaze and found that its deduplication is broken by this. If you let it backup the two files generated by this tool, then restore them, it gives you back two identical files instead of two different ones.
I wanted to try this tool, but the upload requirements were so stringent (must be jpeg, must be same aspect ratio, must be less than 64KB) that I gave up. Would be nice if sample jpegs were provided on the page.
(not OP) I understand why the requirements are strict so I can see why s/he was being downvoted, but I do agree that samples would be nice. Like, do I have to go search for some images with these requirements just to see if it really works (since it was supposed to take $100k)? By now someone mentioned it works in the comments though, so I guess I'll trust them.
$ sips -s format jpeg -Z 500x500 screenshot.png --out bar.jpg
Generating some workable source images is trivial. You're not interested in making a 'perfect hack pdf' for some nefarious purpose, just seeing if the service does what it says it does.
> Generating some workable source images is trivial
Everything is trivial if you know how to do it off the top of your head. Generating white noise samples is trivial. UV-mapping a texture to a mesh is trivial. Soldering resisters to a PCB is trivial. Generating an ARM toolchain with Yocto Project is trivial.
Is it a crime that I didn't feel like researching a bunch of CLI tools I've never heard of to try to use an app I was only mildly curious about?
This overstates the difficulty of creating a JPEG. If you have an image editor on hand that isn't MS Paint (or you're using macOS, which has Preview), you merely need to export a JPEG and choose a low quality setting.
For a Windows user with no decent image editor or viewer installed, though, it could certainly be a hassle.
None of the things I listed are difficult. They just require the right tool and know how. For someone like me with Xubuntu, I had neither the tools nor know how for creating small jpegs. I didn't fell like wasting 30 minutes researching and I didn't feel like walking over to a different computer that has MSPaint.
Most people wouldn't have such small JPGs lying around on their computer, and even less likely that they have 2 of them with the same aspect ratio. So, it's likely that they'll have to go and create them, which is quite a hassle. I know because I went to the site and then after mucking around in some of my directories, gave up.
The key is that essentially all of the data for both images are in both PDFs, so the PDFs are almost identical except for a ~128 byte block that "selects" the image and provides the necessary bytes to cause a collision.
Here's an diff of the 2 PDFs from when I tried it earlier: https://imgur.com/a/8O58Q
Not to say that there isn't still something exploitable here, but I don't think it means that you can just create collisions from arbitrary PDFs.
edit: Here's a diff of shattered-1.pdf released by Google vs. one of the PDFs from this tool. The first ~550 bytes are identical.
https://imgur.com/a/vVrrQ