Hacker News new | past | comments | ask | show | jobs | submit login
A Roadmap for Rich Scientific Data Structures in Python (wesmckinney.com)
70 points by wesm on July 21, 2011 | hide | past | favorite | 12 comments



Scientific computing is dominated by low-level or system languages like Fortran, C, and C++; and by domain-specific languages like Matlab and R.

There is no good high-level, GENERAL computing language that can do scientific computing well. This leads to a huge gap between "research" and "production" implementations.

Python stands to shine in this space if things are done right. pandas is a gem of a library.


It's great that somebody is looking to R for inspiration. I've tried to like NumPy and SciPy on a couple occasions, but find it lacking. That said, there's a to of ways to improve R, mostly to do with cleaning up and homogenizing style.

I do hope that they avoid the BioConductor style of opaque objects and storing data in hidden attributes.


I was happy to see "Hierarchical columns" on the features wish list. My app uses a R data frame like structure (table of heterogeneous columns with possible missing data). One "cute" thing I implemented was hierarchical columns. Super useful for what I'm doing, but hard to expose to other languages. The plan is to flatten the columns to 'parent.child' strings when using python.

I'm not quite at the point where I'll be wrapping this into python. When I do I'll take a close look at pandas. Any other recommendations? I took a quick look at pytables, and pandas looks better for my app.


To be honest I haven't really broached the hierarchical columns issue inside pandas. If someone can look and suggest an implementation strategy that doesn't interfere with the rest of the API I would be all for it.

If you have heterogeneous columns with possibly missing data basically pandas is the only game in town (did you see my diagram?? :) ). It's possible to get a numpy MaskedArray with structured dtype to function like you want but it's relatively tricky to do.


I've been using the pandas library for a long time for financial applications. The data alignment and missing data handling features in pandas are far above and beyond anything else I've used in similar applications. I think from a data structure point of view it's already better than R (and FAR better than Matlab). R/Matlab should fear the day that the pandas statistical packages gains roughly equivalent features.


Data frames (and the associated functionaly for reading and writing csv files) are one of the main features that make me use R over python. Whenever I need to manipulate, merge, slice, and dice table-like data, I turn to R. I would love to have something feature-equivalent in Python. Although maybe rpy2 and rnumpy are sufficient for now.


It is my project but I think pandas.DataFrame is already at 90+% feature-equivalency (especially if you're using the current git version...new release forthcoming). You don't have the integration with a million CRAN libraries. pandas.DataFrame actually does a lot more for you in many places than data.frame does-- for example data alignment is deeply intrinsic whereas it's very much a DIY affair in R.


Yes, I do intend to try out pandas at some point.


Wow, fantastic work and what an important project. I can't believe I haven't used these libraries before.


Without looking too closely, would h5/pytables suffice?


HDF5/PyTables is fantastic as a binary IO format. This is really all about in-memory data manipulation / computation and how data and metadata get passed around to functions that can take advantage of it.


WebSense blocks this at work.

Is there a nice mirror somewhere? :-/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: