Hacker Newsnew | past | comments | ask | show | jobs | submit | kltpacx's commentslogin

I have never understood the appeal of this. You can generate good looking presentations, but that is all.

Is any real science done with this or is it the Powerpoint for PyCon talks?


I was extremely stubborn when I started out in python. Built a script for everything. Jupyter is messy. But once I started using it I never went back for data analysis tasks.

Say you have a large file you want to read into memory. That’s step 1, it takes a long time to parse that big json file they sent you. Then you want to check something about that data, maybe sum one of the columns. That’s step 2. Then you realize you want to average another column. Step 3.

If you write a basic python script, you have to run step 1 and 2 sequentially, then once you realize you want step 3 as well, you need to run 1, 2 and 3 sequentially. It quickly becomes much more convenient to have the file in memory.


I like to imagine it's like a very advanced REPL that's somewhat reproducible (if you run everything from the beginning). If you don't find the appeal of being able to mutate state live for experimentation then it isn't for you.


I didn‘t „get“ Jupyter the first time it used it. A year later it clicked. A Notebook keeps state while you write it. This is different from IDEs, where programs lose state while you are writing code. Now I use it all the time - next to an open IDE, as a playground to quickly test ideas and algorithms.


I also use it in tandem with my IDE day to day as a data engineer, for basically the same reasons. I really like being able to interactively explore and transform a dataframe while developing new pipelines, or debugging existing ones (all of which are implemented as modules in a Python package, not as Notebooks).


This is critically important when one step of your analysis takes 10+ minutes to run. You want to be able to explore the output and not worry about rerunning all the calculations like a file/IDE based tool would.

Reactive notebooks are nice but I’ve accidentally reran slow SQL queries because I updated an upstream cell and that’s painful (using Pluto, ObservableHQ, and HexComputing).

In practice I never see or create notebooks that don’t run when you push the run-all button, it’s a well understood and easily avoidable issue. It’s probably a local optimum but I’m happy with them.


IDEs can 100% do this, too. The art of connecting to a running program using the debugger is just something folks stopped caring about.

This led a lot of programming environments to where batch loading of the code is basically required. But "image based" workflows are a very old concept and work great with practice. Some older languages were based on this idea. (Smalltalk being the main one that pushed this way. Common Lisp also has good support for interacting with the running system.)

It is a shame, as many folks assume everything has to be "repl" driven, when that is only a part of what made image based workflows work.


Curious what other approach you would take to do exploratory data analysis? It's so natural to me I can't think of another way that would be practical to achieve the same workflow.


Handcrafted machine code in punch hole cards.

Interactive environment without compile nonsense is just too new for folks.


emacs org mode can do this but is not tied to just python. Anyway, something like this works:

  #+BEGIN_SRC python :results file
  import matplotlib
  matplotlib.use('Agg')
  import matplotlib.pyplot as plt
  fn = 'my_fig.png'
  plt.plot([1, 2, 3, 2.5, 2.8])
  plt.savefig('my_fig.png', dpi=50)
  return fn
  #+END_SRC

  #+RESULTS:
  [[file:my_fig.png]]


In a true notebook you would maybe want to do the following:

  import matplotlib
  matplotlib.use('Agg')
  import matplotlib.pyplot as plt
  plt.plot([1, 2, 3, 2.5, 2.8])

  Alright, saving the figure at 50 dpi first
  plt.savefig('my_fig.png', dpi=50)

  Trying a bit more DPI to see if that makes a difference
  plt.savefig('my_fig2.png', dpi=150)

  Oh, wrong numbers, forgot that the fourth datapoint was going to signify 100, going back to 50 dpi as well
  plt.plot([1, 2, 3, 100, 2.3])
  plt.savefig('my_fig4.png', dpi=50)
It seems like your example misses the interactivity.


We have a lot of scientists using Rstudio. It’s not quite the same but you can do it. It lets you view your data frames like a spread sheet and generate graphs. It’s R and I get that Jupiter supports R but it’s always has some issue with some dependency.


Ew.

R.

No thank you.


I used to think like that. Programmers hate R. But I took a biostatistics class and it really is the best tool for that job. Plus the graphic output can't be beat (ggplot2) and fairly easy to install packages make it quite valuable tool.


> it really is the best tool for that job.

Besides the ecosystem, what makes R better than Python or Julia for biostats?


Can't speak to julia..

The statistics built in are great. They're just there, less need to find a package (general stats, ttest, chi_squared test...). We tend to use the "tidyverse" packages [1] https://r4ds.hadley.nz/. Bio-python is amazing for manipulating biodata, but once the data is extracted and you need statistics, our scientist seem to use R. I really don't love R's syntax, but I get why they use it. I use python all the time for data wrangling (right now I'm pulling sequences from a fasta file to inject into a table).

Rstudio is like an IDE for your data. You can view the data tables, graph different things etc. If you try the first chapter of the R4data Science book, you can see how get up and graphing and analyzing quite quickly. https://r4ds.hadley.nz/data-visualize.html

Though at this point Python and R are necessary depending on what package/ algorithm you want to use.

There are some good packages for single cell analysis: We use "Seurat".

https://satijalab.org/seurat/articles/get_started_v5.html

Jupyter supports R now with an add in, so its less of an issue.


Yes, tons of science is done with it. I have been co-author on two studies where the ML and DL models were in notebooks. Saying that all you can generate is good presentations is wrong and I dont understand what compells you to make these sweeping claims when you dont are in the target group it seems.


Notebooks are chiefly used for scientific exploration and experiments. The “literate programming” environment provides convenient artifacts for distilling research or analytics.

Nowadays they can even be used for running models/analytics in prod with tools like Sagemaker (though I’m not advocating that they should).

Maybe you’re mistaking Jupyter for a different tool like quarto or nbconvert but your dismissive comment misses the mark by miles.


Not sure about "real science" but it's very convenient for our students. We usually setup a notebook per group for ML-related group projects on our GPU server and also set up notebooks for thesis work etc.

Advantages...no setup on the students side (+they get reliable compute remotely) and we can prepare notebooks highlighting certain concepts. Text cells are usefull for explaining stuff so they can work through some notebooks by themselves. Students can also easily share notebooks with us if they have any questions/issues.

I also use notebooks for data exploration, training initial test models etc. etc. Very useful. I'd say >50% of my ML related work gets done in notebooks.


I’m a “real scientist”. Notebooks are a widely used to run analysis in my field (bioinformatics) where you explore data interactively.

I personally prefer when people share code as notebooks because you have code alongside the results. It’s really a good practice to use Jupyter.


I have found these notebooks very useful in 2 ways besides presentations: as a final exploratory data analysis front end that loads data from a larger modeling and data reduction system, and as a playground to mature workflows into utilities or modules that will later be integrated into a back end reduction or analysis system.

The models run on a small cluster and/or a supercomputer, and the data reductions of these model runs are done in python code that dumps files of metrics (kind of a GBs -> MBs reduction process). The notebook is at the very tail end of the pipeline, allowing me to make ad hoc graphics to interpret the results.


I performed all the data preparation, computation, and image generation for an interactive data visualization website in Jupyter

https://income-inequality.info/

All the processing is documented with Jupyter notebooks, allowing anyone to spot mistakes, or replicate the visualizations with newer data in the future:

https://github.com/whyboris/Global-Income-Distribution


I use it all the time for aoftware development. E.g. when I write DSP code for audio it acts as a mixture of documentation and the actual math with graphs to visualize what I do.

That is why jupyter lab is not the wrong name, it is a bit like a lab. Not meant for production use, but very good for exploring solutions.


So far, Jupyter has been the tool that gives me the best chance of coming back a week or a year later and figuring out what I did. Also, doing "restart kernel and run all cells" before going home for the day is a great reassurance that something is likely to be reproducible.


A heck of a lot of science gets done with this. Something like it is basically mandatory for interactive analysis of datasets large enough they take a decent amount of time to load into memory and process, and jupyter is the best and most common option (you can kind of bodge it with the vainlla python REPL, and there are other options with a similar-ish workflow).


If it was the same but in LISP with horribly mapped keys and used by nobody you guys would be all over it.


Good for developing ideas that you can add small code fragments gradually and see results immediately. And if it gets big enough, chances are that you have a good idea that makes it worth the time to refactor your notebook into production code.


I refactor my code into functions as I go.

Then I can easily put them into a Python file and import them from the notebook.

Easy peasy and very nice for iterative development.


Just started down this path, its such a nice workflow. I find that my notebook ends up with being a great overview of my codebase without going into the details of every function.


Like everything else in the Python ecosystem, it's half-baked and not composable.

People use it for two reasons: a) because they need to get those graphs on the screen and this is the only way b) running ML code on a remote, beefier server.


Neither of those use cases are exclusive to jupyter. You can run scripts on remote machines quite easily, and matplotlib will happily pop up a window for your charts.

The real reason is because it’s a much better workflow for data exploration and manipulation because you don’t always know exactly what code to write before you do it. So having the data in memory is really useful.


X forwarding through a terminal session to view that matplotlib plot is a bit more work than most want to deal with. Sure, you can use ranger, and set up the image viewing with uzerbeurg? or something? and set up kitty with icat, but that doesn't work with your tmux, so you have to have a separate ssh window that's not tmux'd, which is annoying and clunky, just for viewing images. You also have to save them, and then switch to view them, which is incredibly clunky.

Or you just use jupyterlab and the problem is fixed.


kitty icat works with tmux, as of kitty 0.28.0 just FYI.


You are underestimating how useful for exploratory tasks the combination of elements: markdown/code cells + runtime kernel to keep state + persistent results usable from your browser.

Jupyter notebook is neither the first nor the only implementation of such literate approach.

If some code is stable enough for reuse, you can make it composable as any other code: put it into the module/create CLI/web API/etc -- whatever is more appropriate in your case.


>People use it for two reasons: a) because they need to get those graphs on the screen and this is the only way b) running ML code on a remote, beefier server.

Do you have a source of this, or is it something you dreamed up? Weird claim as none of those are my use case.


Well, it is the closer many folks will get from what it meant using a Lisp Machine or Smalltalk development experience.


What would a "composable" experience look like?


You would have to formalize the inputs and outputs of your notebook, perhaps a preamble with imports and so on. Then other notebooks could use yours, kind of like importing (perhaps exactly by importing?)

As it is now, you typically wind up “programizing” your notebook once it does what it should so you can run in batch and so on.


Co-locating code and outputs is handy.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: