Oooh, interesting. How does this work?

akerl_ · on April 4, 2021

Generally, you can consider this an application of Steganography, so searching for that should turn up a number of references showing off how it can be done. Most file formats have attributes where arbitrary data can be inserted without affecting the user’s interaction with the file (for example, making tweaks to the Least Significant Bits for the colors in a JPEG or similar image file).

While in the general case, steganography is discussed in the context of passing secret messages from A to B, watermarking uses steganography to hide a unique identifier that the document’s author can use later to identify whose copy of the document they’re looking at. So in a very basic scenario, the vendor might just shove “exikyut” into a non-visible object on the document, and then when they find the document published online, they check that object and see the username of the person who leaked it. Obviously in the real world, watermarking attempts to obscure the unique identifier’s placement and contents, such that it cannot be easily identified/removed, and so that only the vendor can match a document to its originating user.

exikyut · on April 13, 2021

Really late reply so you'll probably never see this but thanks for taking the time to explain.

I was actually wondering a) what, exactly, the ISO was specifically doing, which is kind of a stupid question :) and b) how to, uhh, "un" the "exactly", I'll word it that way.

Hiding things in invisible objects would be fairly easy to detect. I was wondering if maybe the document might for example embed two spaces every X characters by way of identifier, or use a seeded RNG to pick from multiple visually-identical layout methodologies, or even maybe reencode the images with a uniquely-seeded JPEG scan script, oh oh or maybe adjust individual control points and Bezier curves in the glyph tables, or...

I'd probably just do an outline-to-shape or similar type of pass on it. But then I'd start wondering about the statistical probability of recovering glyph offset micro-adjustments, or hinting settings... eep.

Okay, import the whole PDF into a layout engine then re-export it. Hmm, what if... oh you know they might be reordering the paragraphs in the text... hmm, with 100 discrete text permutations, you could tell 10,000 output documents apart if all 100 permutations were left undisturbed and were recoverable. That's... quite a lot of work. They're probably not doing that.

Or are they?

rdpintqogeogsaa · on April 17, 2021

I know there's a little personal identifier printed vertically on the bottom left of every page. Whether there are additional, steganographic watermarks, I don't know. It'd be interesting, but that'd require for at least two people to throw enough money at ISO to get a diff between the PDF outputs.