A Roadmap for Rich Scientific Data Structures in Python

aklein · on July 21, 2011

Scientific computing is dominated by low-level or system languages like Fortran, C, and C++; and by domain-specific languages like Matlab and R.

There is no good high-level, GENERAL computing language that can do scientific computing well. This leads to a huge gap between "research" and "production" implementations.

Python stands to shine in this space if things are done right. pandas is a gem of a library.

epistasis · on July 21, 2011

It's great that somebody is looking to R for inspiration. I've tried to like NumPy and SciPy on a couple occasions, but find it lacking. That said, there's a to of ways to improve R, mostly to do with cleaning up and homogenizing style.

I do hope that they avoid the BioConductor style of opaque objects and storing data in hidden attributes.

monk_the_dog · on July 21, 2011

I was happy to see "Hierarchical columns" on the features wish list. My app uses a R data frame like structure (table of heterogeneous columns with possible missing data). One "cute" thing I implemented was hierarchical columns. Super useful for what I'm doing, but hard to expose to other languages. The plan is to flatten the columns to 'parent.child' strings when using python.

I'm not quite at the point where I'll be wrapping this into python. When I do I'll take a close look at pandas. Any other recommendations? I took a quick look at pytables, and pandas looks better for my app.

wesm · on July 21, 2011

To be honest I haven't really broached the hierarchical columns issue inside pandas. If someone can look and suggest an implementation strategy that doesn't interfere with the rest of the API I would be all for it.

If you have heterogeneous columns with possibly missing data basically pandas is the only game in town (did you see my diagram?? :) ). It's possible to get a numpy MaskedArray with structured dtype to function like you want but it's relatively tricky to do.

changhiskhan · on July 21, 2011

I've been using the pandas library for a long time for financial applications. The data alignment and missing data handling features in pandas are far above and beyond anything else I've used in similar applications. I think from a data structure point of view it's already better than R (and FAR better than Matlab). R/Matlab should fear the day that the pandas statistical packages gains roughly equivalent features.

rcthompson · on July 21, 2011

Data frames (and the associated functionaly for reading and writing csv files) are one of the main features that make me use R over python. Whenever I need to manipulate, merge, slice, and dice table-like data, I turn to R. I would love to have something feature-equivalent in Python. Although maybe rpy2 and rnumpy are sufficient for now.

wesm · on July 21, 2011

It is my project but I think pandas.DataFrame is already at 90+% feature-equivalency (especially if you're using the current git version...new release forthcoming). You don't have the integration with a million CRAN libraries. pandas.DataFrame actually does a lot more for you in many places than data.frame does-- for example data alignment is deeply intrinsic whereas it's very much a DIY affair in R.

rcthompson · on Aug 3, 2011

Yes, I do intend to try out pandas at some point.

ameasure · on July 22, 2011

Wow, fantastic work and what an important project. I can't believe I haven't used these libraries before.

rch · on July 21, 2011

Without looking too closely, would h5/pytables suffice?

wesm · on July 21, 2011

HDF5/PyTables is fantastic as a binary IO format. This is really all about in-memory data manipulation / computation and how data and metadata get passed around to functions that can take advantage of it.

pnathan · on July 21, 2011

WebSense blocks this at work.

Is there a nice mirror somewhere? :-/