Borb – A Python library to read, write, and edit PDF files

Syzygies · on Sept 18, 2021

I am a math professor with a scanned exam grading workflow that I hacked together as Bash scripts using various open source command line tools. I feed all the exams through a sheet-fed scanner, decode bar codes to identify problems and students, add radio buttons for entering and tracking scores (0-6 per problem), and create PDF "books" per problem for grading and annotating.

Having grad students help grade paper is a consistency nightmare: It's look once, never look back. Instead, after each of several provisional passes I recreate the PDF "book" for that problem, with a chapter for each score, and students randomized within each chapter. In the same spirit as "checking your work lets you work three times faster" this is actually both more consistent and faster that a single pass over paper. Almost all of my attention is on the math, which I'm good at, rather than locating problems and finding again the ones I know I misgraded, which I'm not good at.

Then each student's exam needs to be extracted from these problem PDFs, scores recorded, and annotations frozen.

There are cloud services for grading. They're hopelessly primitive, with cloud lag. Like a gamer, I used to reject wireless mice because of the lag. I reject these services. I can grade everything myself faster than using a team of grad students, with the right local tools.

The PDF format is a morass. My hats off to anyone who will work with it. There are many evolutionary layers and no formal specification or verification; one tests a PDF by seeing if most programs accept it.

It's time for me to rewrite my grading system in a modern scripting language, so others could use it. I prefer Ruby, but that's mainly to stave off boredom when I'm not using Haskell. I can use Python. This would permit a more robust workflow, such as adding late exams in mid-grading without losing grading in progress.

I can't find documentation for Borb, to check off the list of features I'd need. I suspect from this being a one-person project that I might need to continue to patch together external tools.

Svetlitski · on Sept 18, 2021

You should consider looking into Gradescope (Gradescope.com). As a former TA, I can attest to it making grading much more pleasant and streamlined than it would be otherwise.

Evidlo · on Sept 18, 2021

> There are many evolutionary layers and no formal specification or verification;

There is a specification, but it's very complicated.

hyperpallium2 · on Sept 18, 2021

PDF uses postscript, which is Turing equivalent. It's a document format with the halting problem.

mkl · on Sept 18, 2021

PDF does not use Postscript, and is not Turing complete. Its drawing model is based on Postscript's (with additions), but its instruction set is focused on drawing, and can't do programming. Here is the instruction set, with equivalent Postscript commands where applicable: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PD.... Most parts of the Postscript language have no PDF equivalent.

Other things like JavaScript and Flash can be embedded, but they are extras on top of the document.

dunham · on Sept 18, 2021

It's based on postscript, like json is based on javascript. I don't believe there are any control flow (or even general arithmetic?) instructions in the content streams, so I don't see how it could be Turing complete. It's just a sequence of drawing and transformation commands, like SVG.

I presume the extensions to embed javascript are Turing complete, and IMO do not belong in a PDF file.

I've also heard that some of the embedded font formats have features that are Turing complete, but I don't know the details on that.

If there is a way to implement a Turing machine in PDF, outside of the fonts and javascript, I'd love to see the details. (I know somebody managed it with just the macro expansion bit of TeX.)

maxerickson · on Sept 18, 2021

Postscript files are often rendered into PDF, but PDF doesn't use postscript.

Syzygies · on Sept 18, 2021

See dunham's answer. Anyone who has seen a PDF file in text form would swear they're looking at Postscript. There's a common intersection that's identical.

Very roughly speaking (this is a semantic debate where everyone is wrong from someone else's perspective), a PDF file is a restricted subset of Postscript, with added indexes so one can render pages in the middle without having to process the code from the beginning.

The hardship in generating PDFs from scratch is getting those indexes right. It's far easier to convert a Postscript file using standard tools.

maxerickson · on Sept 19, 2021

That's fair. I was waffling about using stronger language, "pdf isn't postscript", but didn't get there. It would have been more correct.

kaba0 · on Sept 19, 2021

The same way as JSON is a subset of JS, it has radically different properties that way, so comparisons to postscript is not really meaningful.

einpoklum · on Sept 18, 2021

That is not a realistic concern for GP. The scanned exams won't even involve any Postscript.

Syzygies · on Sept 18, 2021

I do insert the score radio buttons using the "pdfmark" mechanism via Postscript.

selfhoster11 · on Sept 18, 2021

HTML uses JavaScript which is also Turing complete. Lack of the halting problem is nice, but having it is not a show stopper.

Tijdreiziger · on Sept 18, 2021

Perhaps you might be interested in the Zesje project: https://gitlab.kwant-project.org/zesje/zesje

mkl · on Sept 18, 2021

More information: https://sandbox.grading.quantumtinkerer.tudelft.nl/, https://zesje.tudelft.nl/about/

It sounds like it's still mostly a prototype?

Tijdreiziger · on Sept 19, 2021

Sorry, I should have elaborated in my initial comment.

I was briefly involved as a developer several years ago (as part of my bachelor's thesis). At that time, it was mostly beta-quality, but it was already in use by multiple professors for grading. I haven't been involved with the project since, so I'm not sure about the current status.

I think the homepage [1], which you linked to and where it mentions that it's still a prototype, is at least somewhat outdated; it has a screenshot of a very old version of the software. At least the 'support' section still looks accurate, though.

If you're interested in using it, I would advise getting in touch via the Mattermost channel or mailing list (both linked to from the homepage [1]) and asking about the current state of the project. Tell them Jamy sent you :)

[1] https://sandbox.grading.quantumtinkerer.tudelft.nl/

jhgb · on Sept 18, 2021

> Having grad students help grade paper is a consistency nightmare: It's look once, never look back.

Captcha-ize them, with several of them grading the same result, and with checking their responses against each other?

gettalong · on Sept 20, 2021

If you want to do something like this in Ruby, have a look at HexaPDF - https://hexapdf.gettalong.org/ - which provides a full-blown implementation for reading and writing PDFs and is quite mature already (n.b. I'm the author).

It is licensed AGPL+Commercial but if you just use it for yourself, this does not matter as you can use the AGPL.

strzibny · on Sept 18, 2021

Would be cool to know the differences to Ruby's HexaPDF. One is certainly the license.

senorsmile · on Sept 19, 2021

Would you be willing to list all of the command line tools you're using?

spapas82 · on Sept 18, 2021

Haven't tested this lib, however be careful before including it in your project because of its license (it is dual licensed agpl/commercial). This means that you can use it only if your project is GPL or else you need a commercial license.

On the other hand, the reportlab pdf generation library (which is what I actually use) offers a permissive language in its open source version (and a commercial reportlab plus version), so it can be included in all kinds of projects.

mkl · on Sept 18, 2021

Confusingly, while the README says AGPL/commercial, the LICENSE file (https://github.com/jorisschellekens/borb/blob/master/LICENSE) says it's GPL.

mdaniel · on Sept 18, 2021

I'd guess it's just a copy-pasta error, because all the source files contain the AGPL header text: https://github.com/jorisschellekens/borb/blob/v2.0.9.1/borb/... and https://github.com/jorisschellekens/borb/blob/v2.0.9.1/borb/... for example

But the plot thickens! It seems the top-level LICENSE file was actually changed 13 days ago _away_ from AGPL https://github.com/jorisschellekens/borb/blame/master/LICENS...

So, yeah, confusingly for sure

sigg3 · on Sept 18, 2021

An AMBIGUOUS LICENSE situation is a red flag.

Was the issue raised with the author?

xjlin0 · on Sept 18, 2021

ReportLab is indeed a great library. Another great one is WeasyPrint. How are these compared with Borb?

wodenokoto · on Sept 18, 2021

Some books starts counting pages after the table of contents and so the pdf page number and the book page number are not in sync.

I’ve seen some PDFs have the first few pages counted in Roman numerals and then “normal” numbers for the main content.

How do you edit an existing pdf to do that?

shakna · on Sept 18, 2021

Page numbers are part of the PageInfo object within a PDF, but it looks like currently this is generated naively [0], so I don't think multiple numbering schemes are currently supported by this library.

[0] https://github.com/jorisschellekens/borb/blob/master/borb/pd...

alephu5 · on Sept 18, 2021

Amazing, I've been yearning for something like this for years but have always been told it's impossible. Can't wait to try it

eurasiantiger · on Sept 18, 2021

It is literally impossible for all PDFs, since some of them may not have any kind of semantic structure and consist of a set of graph bitmaps laid out at specific coordinates to make up blocks of text.

anonymouse008 · on Sept 18, 2021

The best PDF bug is when the linker between the Adobe character value and the font/language is broken and you get random Unicode like values with no way to connect the two.

Infuriating.

dunham · on Sept 18, 2021

Some PDF files intentionally include a bad character mapping table (and reorder the font) as a form of DRM.

mkl · on Sept 18, 2021

It's not very effective against anyone determined though. You can OCR easily. You can also rebuild the character mapping from the shapes of the glyphs, and in most languages there are few enough that you can even do it by hand.

mcswell · on Sept 18, 2021

You can OCR if you're using a Latin script with few if any accent marks. Depending on your OCR engine, I suppose you could do Cyrillic too. Other scripts, not so much. (And yes, I'm a computational linguist, so we deal with non-Roman scripts all the time, particularly Arabic script. But I suppose that's not a problem for most people here :-).)

There might be some Latin script fonts that cause problems, but I haven't looked into that very much--I do recall we had problems with an italic font.

dunham · on Sept 18, 2021

When I came across this I already had a pristine copy of the font, so I just compared the program for each character to determine the mapping. (I was automating the decoding.) I agree that there is little to no security there.

But the point that I was not so clearly trying to make was that sometimes the messed up encoding is intentional and not a bug.

rstuart4133 · on Sept 19, 2021

There are a few PDF Python libraries and open source programs out there, but as far as I can tell all lack one feature: signing. If anyone knows of a open source tool kit or library that can sign, I'd be most appreciative.

jl6 · on Sept 18, 2021

Always good to see more open source tools for the PDF ecosystem.

I couldn’t see any support for PDF/A (the good version of PDF) in borb though.

pixelmonkey · on Sept 18, 2021

The README in the GitHub repo for borb is a bit of a better explainer than this landing page (especially for Python programmers).

https://github.com/jorisschellekens/borb/blob/master/README....

avnigo · on Sept 18, 2021

And the example repository shows off its capabilities better too:

https://github.com/jorisschellekens/borb-examples

mrweasel · on Sept 18, 2021

The examples are pretty useful, seeing as there apparently isn't any documentation yet.

It a really amazing project. One of those that makes you go: "Wait we didn't have this before?"

nickspain · on Sept 18, 2021

This is awesome! I've been looking for something that could be a link between PDFs and Instapaper[0] for a while. This looks like it'll be perfect to build such a tool with.

[0] https://www.instapaper.com

chrismorgan · on Sept 18, 2021

I have a PDF of a hymn book that I want to convert to 2-up so I can use it that way on my reMarkable which only supports single-page display and not two-page spreads; but I also want all the internal hyperlinks (to and from a table of contents) to keep working. I haven’t found any software that seems capable of doing this (though I’ve only looked at FOSS; wouldn’t surprise me if something Adobe could do it). The closest I seem to have found is qpdf which might be able to do it with some programming effort.

Is that sort of thing going to be in scope for this library’s editing capabilities? (“Editing PDFs” is such a broad, open-ended thing.)

cycomanic · on Sept 18, 2021

So how does this compare to the python bindings of mupdf? Which IMO is the most featureful module to manipulate PDFs in python (I'm a bit buffled by all the comments that something like this didn't exist before).

jacob019 · on Sept 18, 2021

Borb is pure python. MuPDF is C with bindings. Reportlab is pure python.

I use reportlab combined with PyPDF2 and pdf-redactor. It would be nice to see a comparison with the existing tools.

einpoklum · on Sept 18, 2021

So, are there decent lower-level (e.g. C++ or even C) libraries for doing this, which this library wraps? Or does this actually do the nitty-gritty PDF innards itself?

As for myself, I've not had to automate work on PDFs, luckily; for manual manipulation and annotation I've found Xournal++ sort of useful (https://xournalpp.github.io/). Inkscape can also be used with some questionable PDFs.

hansvm · on Sept 18, 2021

> or does this actually do the nitty-gritty PDF innards itself?

> with borb, a pure python library

sneak · on Sept 19, 2021

I was going to be upset if the project logo were not a fat birb. I was not disappointed. :)