It sounds like you are using the tool wrong. Jupyter notebooks are strictly supe...

steve_gh · on July 7, 2023

Yeah but jupyter notebooks suck at providing reproducible data science. I encourage my team's not to use Jupyter for data science.

Our preferred toolchain is based on make to build data science pipelines. Every step is scripted, and make ensures that upstream changes or script changes trigger downstream changes, ending with charting with gnuplot or similar. Our output charts all are not only timestamped but have a git commit id. And our source repositories contain a data manifest so we have commit IDs right into ETL stages into the DB.

End result is that in a couple of months, when the CxOb asks about some piece of work and pulls out a chart, we can trace the entire data pipeline used to create it, and reproduce it if required. That saves so much hassle!

enriquto · on July 7, 2023

> Yeah but jupyter notebooks suck at providing reproducible data science.

That depends on how you use the notebooks.

With just a tiny bit of discipline, you can integrate notebook users into your sane workflow. For example, encourage people to restart the kernel and run all cells a few times per day (and definitely, before sharing anything). Meaningful output artifacts can be saved into files, that are later read by the notebook and displayed.

Then, when users are satisfied with their notebook, they save it as a python file thanks to jupytext, and commit it to git.

This workflow integrates well with your makefile setup: to reproduce the notebook and obtain its results you simply run it as a script. If you want a pdf or a static html that shows the notebook as-is, you can nbconvert it from your makefile.

For example, if your makefile has lines like these:

    %.ipynb : %.py    ; jupytext $< --to notebook
    %.html  : %.ipynb ; jupyter nbconvert --execute --to html $<

Then you run "make foo.html" and it will convert "foo.py" to "foo.ipynb", run all the cells, and produce a static visualization "foo.html". Since the intermediary notebook is not marked as a precious file, it is deleted automatically by make.

Notice that you can simply run "python foo.py" as well, to produce the valuable output artifacts.

In the end, jupyter becomes just an editor of python files. A fancy editor, that allows interactive execution of pieces of code, which is great.

kortex · on July 8, 2023

> Yeah but jupyter notebooks suck at providing reproducible data science.

Why? I have no problem with reproducibility when I use a little bit of discipline.

Your workflow does indeed sound nice but also sounds like it involves way more tooling and institutional knowledge. Anywhere I can learn more about it or see the scripts you use?

steve_gh · on July 8, 2023

I completely agree that a bit of discipline does wonders. But it is not just your discipline, it is the team's discipline.

What I need to ensure is that anyone picking up a piece of analysis 3 months later can reproduce exactly what was done. I've been burnt in the past by having to go back to the original analyst and be told "oh you run this bit of this notebook, then paste the results in over here, then run that". By insisting that everything is scripted and that there are no manual steps, we get a reproducible analytics pipeline.

The starting point for our methodology is the book "Guerilla Analytics" by Enda Ridge. It's worth reading.

kevinskii · on July 7, 2023

I agree with the OP. VS Code using the Jupyter protocol is superior to notebooks in almost every respect in my experience. It gives you an excellent debugger, the ability to track changes in Git without any modification, and you can also run as a regular Python script.

esafak · on July 7, 2023

Jupyter offers nothing that Mathcad and Mathematica didn't in the 80s. We should be using open source, git-friendly file formats so we can edit them collaboratively in our editor of choice; e.g., our IDEs. We are not using it wrong; Jupyter notebooks reflect an archaic product philosophy and way of working. Kill it with fire.

sdfghswe · on July 7, 2023

> It sounds like you are using the tool wrong. Jupyter notebooks are strictly superior to anything else (namely: code only, spreadsheets, matlab/octave) at their primary use case, which is interactive data science (writing code to manipulate some data, while actively revising the code, or sharing the results of that code with others).

> Nothing even comes close. There's a reason it's dominant in the data science field.

> Your workflow works for you but the jupyter workflow works for millions of students, data scientists, and even developers. Heck I even know all the ways to avoid jupyter, and I still use it often, because it's so convenient.

Copy pasting your comment here so when you eventually delete it people can still see the ignorance.

You have absolutely no clue what you're talking about. Worse, it seems like you didn't read what you're responding to.