I've done the same thing before on a book in college, except my DRM wasn't nice enough to let me print 10 pages at a go. Instead, I spun up a virtual X server with Xdummy with a resolution of 1200x10000. That showed a few dozen pages at a time. Then I automated screenshots (scrot) and PageDown (xdotool). Finally, some PIL magic to look for the thin gray line between pages plus convert and ghostscript and I had a PDF!
Watermarks and such are easy to remove with PDFtk, the swiss army knife of PDF files. Convert the input files to a plain-text representation, find the code that implements the watermark (it will be the only code that's identical on every page), delete it and convert back. Easy as pie. It will also concatenate the partial files.
I agree; it's a fun and inspiring story, and the automation is great, but IMHO getting a PDF of a DRM'd book which consists entirely of bitmap images of each page is near the bottom of my list of options if I were in the same situation; it's probably somewhat above the "analogue hole" option of taking a picture of each page with a camera pointed at the monitor. Unless that's what the downloaded PDF consisted of, it would be far better to work with the PDF directly since it is intrinsically a vector format and removing the watermarks from the page content stream that way (it's a PostScript-like language) would be superior in terms of filesize and quality.
A few years ago I was involved in a project that required capturing content from an online document viewer site that used a SWF for each page, and our product basically did a vector-vector conversion from SWF to PDF. One of our competitor's product did the "render each page as an image, and combine the images into a PDF" and the output difference was amazing. IIRC it was around 2 orders of magnitude in size, and 4 in speed of generation. (They used Java, and we used C, which might account for some of that too, so it's not a totally fair comparison.)
How does that preserve things like layout or equations?
I've done similar (though usually less effort) textbook trickery a few times. The Adobe Inept hack is very handy. Oh, and a recent one was stupidly easy: you could view the ebook in your browser, and save excerpts as a pdf, but only 100 pages in total per book. Problem was it stored how many pages you had saved in a cookie, so "Clear the last 5 minutes of browsing history" and you could get another 100 pages, rinse and repeat for all the book and then staple the files together with pdftk.
I think the "plain text representation" parent is referring to is the PostScript that defines the page. If the equations are rendered in PS, or are inline images, they might survive the roundtrip conversion?
Yes, it's a jumble of PostScript fragments, base64-encoded images and PDF metadata. Everything that's needed to reconstruct the original PDF, but in a form that's safe to edit in a text editor.
You might want to compress the PDF again when you're done (my understanding is that part of what makes PDFs non-plain-text is binary compression encodings within the PDF container.)
PDFs typically are full of Postscript (except if they're just scanned images), which is just a text rendering language. As long as you keep the Postscript format valid, you could remove the watermark by just deleting that text.
I didn't know about PDFtk, but Ghostscript can take a PDF and turn it into text Postscript, and it can reverse the process.
ghostscript has always either rasterized or converted to individual strokes embedded fonts when I try that, but there's like 50 options for pdf->ps so perhaps I've got it wrong.
PDFtk won't work with some of the more modern PDFs. Adobe added in another layer, kept the same extension and blocked out most of the 3rd party readers. Some tool called LiveCycle I think.
I had encountered this in the lower div general education courses at my university. The most egregious one was the professor who required us to get the current edition of the textbook (since he would be assigning problems out of the new book, which was just a shuffled version of the previous version) that he wrote himself. Fitting that it was an economics course.
Once I got into the CS courses, most if not all of my professors just provided PDFs of either their own material or some open source textbook they were contributing to.
I have this idea that any material that uses DRM should not be covered by copyright simply because it removes itself from things that will end up in the public domain.
For me, one of the wonderful things about copyright is that works always end up available for free to the general public. A DRMed work will never be free in that sense, and should then not be covered by the regular legal protections.
TTIP and similar deals fucks it up for the rest of us as well.
I say a maximum of 25 years free copyright (i'd rather see something like 5-10 years), and then progressively increasing fees that start becoming crazy after something like ten years.
Doesn't seem like he actually cracked any DRM - he downloaded the book (as he was apparently entitled to do under his license, 10 pages at a time) and used an image editor to remove the watermarks. The digital equivalent of printing it and using whiteout to remove the watermark.
I think it would be hard to prove that he cracked any DRM.
Cracking DRM should not be an offense if you already bought that thing. Except that DRM lobby declared it illegal (it's a corrupted anti-circumvention law).
There is safety in numbers. The more people do this, the less any particular one will be targeted.
Consider it was not so long ago that almost everyone talked about pirating something, usually with torrents or some other P2P, and nothing happened to the overwhelming majority of them.
They have his name, and where he studies, so they use the legal process (or Google) to get a list of his classes, and a list of the books used for those classes.
That doesn't seem particularly difficult for them to do. I doubt anyone would actually bother, but still, it's not tricky.
As someone interested in Clojure, it's cool to see stuff like this build with it. Luckily I haven't needed to purchase expensive books since freshman year. Everything I've been required to I either rent on Amazon for $30 or find online.