Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Getting a full PDF from a DRM-encumbered online textbook (vgel.me)
109 points by mr_tyzik on Oct 21, 2015 | hide | past | favorite | 32 comments


I've done the same thing before on a book in college, except my DRM wasn't nice enough to let me print 10 pages at a go. Instead, I spun up a virtual X server with Xdummy with a resolution of 1200x10000. That showed a few dozen pages at a time. Then I automated screenshots (scrot) and PageDown (xdotool). Finally, some PIL magic to look for the thin gray line between pages plus convert and ghostscript and I had a PDF!


Watermarks and such are easy to remove with PDFtk, the swiss army knife of PDF files. Convert the input files to a plain-text representation, find the code that implements the watermark (it will be the only code that's identical on every page), delete it and convert back. Easy as pie. It will also concatenate the partial files.


I agree; it's a fun and inspiring story, and the automation is great, but IMHO getting a PDF of a DRM'd book which consists entirely of bitmap images of each page is near the bottom of my list of options if I were in the same situation; it's probably somewhat above the "analogue hole" option of taking a picture of each page with a camera pointed at the monitor. Unless that's what the downloaded PDF consisted of, it would be far better to work with the PDF directly since it is intrinsically a vector format and removing the watermarks from the page content stream that way (it's a PostScript-like language) would be superior in terms of filesize and quality.

A few years ago I was involved in a project that required capturing content from an online document viewer site that used a SWF for each page, and our product basically did a vector-vector conversion from SWF to PDF. One of our competitor's product did the "render each page as an image, and combine the images into a PDF" and the output difference was amazing. IIRC it was around 2 orders of magnitude in size, and 4 in speed of generation. (They used Java, and we used C, which might account for some of that too, so it's not a totally fair comparison.)


How does that preserve things like layout or equations?

I've done similar (though usually less effort) textbook trickery a few times. The Adobe Inept hack is very handy. Oh, and a recent one was stupidly easy: you could view the ebook in your browser, and save excerpts as a pdf, but only 100 pages in total per book. Problem was it stored how many pages you had saved in a cookie, so "Clear the last 5 minutes of browsing history" and you could get another 100 pages, rinse and repeat for all the book and then staple the files together with pdftk.


I think the "plain text representation" parent is referring to is the PostScript that defines the page. If the equations are rendered in PS, or are inline images, they might survive the roundtrip conversion?


Yes, it's a jumble of PostScript fragments, base64-encoded images and PDF metadata. Everything that's needed to reconstruct the original PDF, but in a form that's safe to edit in a text editor.


You might want to compress the PDF again when you're done (my understanding is that part of what makes PDFs non-plain-text is binary compression encodings within the PDF container.)


Yeah, that was what I was thinking.

PDFs typically are full of Postscript (except if they're just scanned images), which is just a text rendering language. As long as you keep the Postscript format valid, you could remove the watermark by just deleting that text.

I didn't know about PDFtk, but Ghostscript can take a PDF and turn it into text Postscript, and it can reverse the process.


ghostscript has always either rasterized or converted to individual strokes embedded fonts when I try that, but there's like 50 options for pdf->ps so perhaps I've got it wrong.


PDFtk won't work with some of the more modern PDFs. Adobe added in another layer, kept the same extension and blocked out most of the 3rd party readers. Some tool called LiveCycle I think.

But for normal PDFs PDFtk is incredibly useful!


livecycle is only really used for interactive forms and it's not part of the pdf spec.


DRM-laden online books is a red flag that the college you are attending is a thinly veiled profit center, not an education provider.

Has anyone made an index of which colleges require DRM textbook purchases in their courses?


I had encountered this in the lower div general education courses at my university. The most egregious one was the professor who required us to get the current edition of the textbook (since he would be assigning problems out of the new book, which was just a shuffled version of the previous version) that he wrote himself. Fitting that it was an economics course.

Once I got into the CS courses, most if not all of my professors just provided PDFs of either their own material or some open source textbook they were contributing to.


Going to guess 95%+ make use of online access codes in freshman classes. Saves the TAs for upper level classes.


> DRM-laden online books is a red flag that the college you are attending is a thinly veiled profit center, not an education provider.

So then most US colleges? Like spectralblu said, this is common in lower level classes—especially math classes for some reason.


I have this idea that any material that uses DRM should not be covered by copyright simply because it removes itself from things that will end up in the public domain.

For me, one of the wonderful things about copyright is that works always end up available for free to the general public. A DRMed work will never be free in that sense, and should then not be covered by the regular legal protections.


As long as Disney is around, nothing will ever enter the public domain again.


TTIP and similar deals fucks it up for the rest of us as well.

I say a maximum of 25 years free copyright (i'd rather see something like 5-10 years), and then progressively increasing fees that start becoming crazy after something like ten years.

Then use that money to finance culture.


I'm glad he did it, and published the result. But isn't cracking DRM a criminal offence? Is it wise to confess to it in a public forum?


Doesn't seem like he actually cracked any DRM - he downloaded the book (as he was apparently entitled to do under his license, 10 pages at a time) and used an image editor to remove the watermarks. The digital equivalent of printing it and using whiteout to remove the watermark.

I think it would be hard to prove that he cracked any DRM.


Did you read the same article I did?

He didn't download the book 10 pages at a time, and he didn't use an image editor to remove the watermarks.

He wrote a script that simulated navigating through the book with a mouse and keyboard and a browser, and generated a bitmap image of every page.


Cracking DRM should not be an offense if you already bought that thing. Except that DRM lobby declared it illegal (it's a corrupted anti-circumvention law).



Indeed, that's why DRM was broken and will be broken despite any attempts of DRM lobby to proliferate it.


Is it wise to confess to it in a public forum?

There is safety in numbers. The more people do this, the less any particular one will be targeted.

Consider it was not so long ago that almost everyone talked about pirating something, usually with torrents or some other P2P, and nothing happened to the overwhelming majority of them.


They would have to prove the book he actually cracked.


That doesn't seem particularly difficult; prosecutors could just get a warrant to search his computer for it.


And which specific book would they we looking for?


They have his name, and where he studies, so they use the legal process (or Google) to get a list of his classes, and a list of the books used for those classes.

That doesn't seem particularly difficult for them to do. I doubt anyone would actually bother, but still, it's not tricky.


As someone interested in Clojure, it's cool to see stuff like this build with it. Luckily I haven't needed to purchase expensive books since freshman year. Everything I've been required to I either rent on Amazon for $30 or find online.


I notice he said the final product was around 700 megabytes, which is a bit absurd. What could he have done to make the final size more reasonable?


The PNGs are insanely high resolution I think so that the OCR works better - it doesn't say, but I assume the OCR'd book can use lower resolution.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: