Automated PDF Reports with Python Notebooks

apwheele · on June 29, 2022

FYI for folks if you want to convert jupyter notebooks to html you can use nbconvert:

    jupyter nbconvert --execute example_report.ipynb --no-input --to html

I have never been able to get a nice looking report in both html/PDF though via the tools to print html to PDF (as opposed to building a nice looking PDF doc using LaTeX directly). So default just worry about making the html look nice.

pplonski86 · on June 29, 2022

The Mercury uses `nbconvert` internally to convert notebooks, so you don't need to execute them manually. It also adds additional features to make it more user-friendly:

- you can add widgets (parameters) to the notebook, without rewriting the notebook,

- you can easily schedule the notebook,

- you can convert notebook to PDF with mouse click,

- you can serve interactive HTML notebooks with a Django-based server.

anyfactor · on June 29, 2022

That is super interesting. Outside of Jupyter notebooks, I have been generating PDF reports with Jinja and Pandoc.

- Convert dataframe and other stuff to HTML render them to usable HTML using Jinja templates.

- Generate PDF from html using Pandoc with Xlatex

annowiki · on June 29, 2022

I've been doing this but without the jinja. It's trivial to store a latex file or markdown file with something like %PLACEHOLDER% and then just run

    with open('template.tex', 'rt') as f_in:
        template = f_in.read()
    template = template.replace('%PLACEHOLDER%', data)
    fname = f'rendered-{dt.now().isoformat()}'
    with open(fname, 'wt') as f_out:
        f_out.write(template)
    fileout = 'outdir'
    subprocess.call(['pdflatex',  '--output-directory', fileout, fname])

pplonski86 · on June 29, 2022

There are many ways to build PDF reports with Python. The notebook approach gives some WYSIWYG while report building. Additionally, when used with Mercury framework you get scheduling and email notifications.

tpoacher · on June 29, 2022

Same. I hate jupyter notebooks with a passion.

I'm generating beautiful html/pdf reports with a simple bashscript, and minimal markup added directly to the source code. Works like a charm.

pea · on June 29, 2022

You might find the Python framework we've been working on helpful for that use-case: https://github.com/datapane/datapane. It allows you to create interactive HTML reports comprised of pandas DataFrames, plots, and UI elements (e.g. dropdowns, selects, pages).

Standalone HTML files provide a really nice alternative to PDF as they maintain interactively: you can host them static sites, allow people to download data, use plots interactively, click through pages, similar to a statically generated website. That said, there is still a definite blocker in non-technical people receiving a .HTML file over email and immediately thinking it's suspicious or a virus (doesn't help that gmail has such poor support for them.) It's a shame, because PDFs have so many warts and HTML can be used as a really nice distributable file format, especially as you can make them fully standalone by baking in datasets, plots, libraries, etc. so they can be used without network.

IMO Jupyter is great at what it is - a REPL - but, outside of sharing a step-by-step "here are the steps I took to come to this answer", isn't the ideal format for sharing insights, as there is no reason a report would follow the same narrative as the analysis itself.

pplonski86 · on June 29, 2022

What do you hate the most about jupyter notebooks?

tpoacher · on June 29, 2022

There is no single thing, it's more like death by a thousand papercuts.

But if I did have to pick "the one" (I see you trying to pull a Scott Adams, good stuff) it would be that it has the semantics of Literate Programming completely backwards.

Literate programming is about 'code' first. It should be that you have a codebase which honours code semantics first, and allows you to create a meaningful report from that code second. This guarantees that you have a codebase structure that is easy to traverse and read, while still following good software engineering practices and usable as code in itself, as well as being able to trivially create a report from it.

Jupyter notebooks is an app (not even just markup, but an app, and a clunky one at that) for writing reports first, in the form of snippets from which you may or may not be able to generate useful code or outputs later on. And ultimately, even if you do, that code is crappy and unusable in any context other than the notebook it was written for, because of the way jupyter notebooks are designed. They force the programmer to create a monolithic spaghetti structure, preventing modularity or meaningful code hierarchies to take place in the codebase, promoting imports and definitions appearing just before the point of their use rather than in reasonable scopes, promoting modifiability instead of extensibility, while preventing programmers from using a vast majority of appropriate software tools for versioning, testing, etc.

And for some bizzare reason they are becoming the de facto standard for data scientists sharing code. It's like we're encouraging people to return to "Academic Matlab" code-quality standards all over again after years and years of trying to teach academics proper software engineering practices.

aldanor · on June 30, 2022

No, Jupyter is not an app for writing reports and it's definitely not its primary use across DS.

When it may take 5 minutes to simply load the data, you can no longer rerun some Python scripts to mess around and experiment with stuff. You need that kernel with data and everything else to have 100% uptime.

Similarly, you may want your temporary experiment results that may have taken a while to compute to stay within the kernel even if you're already working on something else.

That, plus an ease of inline visualisation, displaying tables and all that.

agoose77 · on June 29, 2022

Eh, I think this misses the point of why Jupyter Notebooks are useful, and who is using them.

I agree that in terms of literate programming as Knuth defined it, Notebooks are not great. There are tools to improve that story; I wrote https://github.com/agoose77/literary which at least lets you do a bit more "tangling and weaving" than you can out of the box. It doesn't let you define functions in arbitrary order, or implement fragments of a code block, but it does let you "boil down" a literate representation into something that is zero-cost at runtime and imports. There's also nbdev, although it's not my cup of tea.

The real point, though, is that most data-scientists aren't using (imo) notebooks to write and share libraries of code. Instead, they're using notebooks as semi-reproducible reports. I'm a physicist, and that's what I've been using Jupyter for. For me, Jupyter Notebooks are fantastic - the cell mechanism lends itself to rich-outputs that augment the narrative, and present the information in-line with the code that wrote it.

For me, the biggest gap here is writing _libraries_ that are leveraged in these notebooks. That's why I wrote Literary - to try and resolve some of the pain points that currently require you to use two tools (Jupyter Lab & e.g. PyCharm). I'm not saying it will work for everyone, or solve all of the problems, but for me it's enough to write my analysis as a package, so that's a limited success in my book.

agoose77 · on June 30, 2022

aldanor also mentions another use case which I only really allude to despite it being an important part of the process: exploratory work. Having a live kernel that maintains kernel state with the benefits of rich outputs is a mainstay of research.

cinntaile · on June 29, 2022

I don't know about the grandparent but I hate the browser based Jupyter notebooks, luckily I found that you can run Jupyter notebooks inside of Visual Studio. It's a more fluent experience overall.

pplonski86 · on June 29, 2022

Hey! Author here. I'm working on open-source framework called Mercury. I'm building it to make notebooks sharing easy, especially with non-technical users. The Mercury can turn Python notebook to web application, dashboard, presentation, REST API or report. It has option to easily hide the code, schedule automatic execution, convert to PDF, and send email notifications. The Github repo: https://github.com/mljar/mercury

osullivj · on June 29, 2022

Do you have a Mercury vs Voila comparison?

pplonski86 · on June 29, 2022

Mercury has a different architecture. The Voila keeps the live kernel and connection to UI (using Tornado framework). To use Voila you need to add widgets (with ipywidgets) to the notebook (mix UI code with analytics code). The Mercury generates UI based on the YAML header (very simple, no need to mix UI with analytics code). When user tweaks widgets values, the whole notebook is executed with new parameters and converted with nbconvert (using Django + Celery). Mercury can serve multiple notebooks to multiple users on one server. It has option to export notebook to PDF or HTML. You can schedule the notebook execution with crontab string (for example `schedule: '30 8 * * 1-5'`) and add email notifications. You can easily add authentication to notebooks. It was designed to make notebook sharing fast and easy.

What is more, I'm thinking about including Voila into Mercury. So you will be able to serve Voila apps with Mercury. The end-goal is to make notebooks sharing easy.

monkeydust · on June 29, 2022

A lot of B2B enterprise software follows this scenario:

- Ingest some client data.

- Process client data - so the bit that adds the value for client, normally this will involve some cleaning, normalization, algos, external enrichment.

- Produce human readable output as a pdf (tables, graphs, charts etc)

- Repeat this on a set schedule.

This certainty looks like a solid foundation to use to solve for these types of scenarios

MilStdJunkie · on June 29, 2022

I wonder why they didn't pick up Weasy on the PDF build side? That's my go-to if I have access to a Python environment.

Probably too resource hungry, or some flippery making Jupiter work with it. I've seen Jupiter all over the place but haven't played with it.

https://github.com/Kozea/WeasyPrint

https://weasyprint.org/

ropeladder · on June 29, 2022

This looks pretty cool, is there a way to do column layout (or any kind of positioning) when making a PDF, or is it just flowing the notebook html?

pplonski86 · on June 29, 2022

Right now it is flowing the notebook html. It should be possible to add some layout library (Python library that will do HTML+CSS to get nice layout).

Recently I did similar for displaying numbers in the notebook as a good looking boxes. I created a small Python package that takes the number and creates a box with borders with HTML+CSS. It is pretty handy for building dashboards in Python. The package name is Bloxs https://github.com/mljar/bloxs