Pandas 0.15 has been released

elliott34 · on Oct 20, 2014

I work in pandas 95% of my day doing data automation tasks, manipulating sql queries, and data-munging for machine learning. It is literally life changing for someone like me.

I used to program solely in R, but after discovering pandas I really have no need to go back to R. My project workflow consists of several IPython notebooks+pandas+sklearn.

Works extremely well in production, as in, on a flask web server, as well.

datahipster · on Oct 20, 2014

You ought to take a look at https://pypi.python.org/pypi/ipython-sql. This, in conjunction with IPython magic helpers specified in your IPython profile, can turn IPython into a SQL client. Here's a good example: https://gist.github.com/slnovak/583e35083ebd42892fab

b_friedland · on Oct 20, 2014

Thanks for this, looks like it's going to be pretty handy.

elliott34 · on Oct 20, 2014

OOOOOO awesome. Thank you!

minimaxir · on Oct 20, 2014

For the particular tasks of "data automation tasks, manipulating sql queries, and data-munging for machine learning," Python is indeed better than R, especially with sklearn

For other applications (especially charting and data manipulation with ggplot2 and dplyr respectively), R has an edge.

tlmr · on Oct 20, 2014

I don't think so. Have you checked out seaborn, blaze and bokeh?

tdaltonc · on Oct 20, 2014

Not OP, but I've also been look for a way to leave ggplot2 behind and make the jump to 100% python for data analysis. These look neat.

Seaborn - http://web.stanford.edu/~mwaskom/software/seaborn/

Blaze - http://blaze.pydata.org/docs/v_0_6_5/index.html

Bokeh - http://bokeh.pydata.org/

hadley · on Oct 20, 2014

There's also the confusingly named ggplot (for python): https://github.com/yhat/ggplot

tdaltonc · on Oct 20, 2014

I've used that library, and I really don't like it at all. They tried to bring the R syntax to python, which ends up looking awful and is missing the point of the Graphical Grammar. In the same way that every language has it's own way of expressing control flow, every language should have it's own way to express the Graphical Grammar. We don't need R's GGplot2 in python, we need a pythonic way to express the Graphical Grammar.

If I had stronger python-fu I would love to build "GGPy".

tlmr · on Oct 20, 2014

Isn't GGPy seaborn?

tdaltonc · on Oct 20, 2014

Maybe. The plots look a lot like GGplot2 plots, any the syntax looks like python, but I haven't dug in to it to see if it builds plots using the Graphical Grammar under the hood.

elliott34 · on Oct 20, 2014

we've shifted to plot.ly for visualization and moved away from expensive BI tools. We set up python/pandas scripts on CRON to output realtime data to our local plot.ly web server, which makes a local copy of the data and updates the chart. You simply embed the chart as an iframe where ever want internally, and BOOM, you've got a real time chart (beautiful I may add).

www.plot.ly

jackgolding · on Oct 21, 2014

$60 per user per month is a bit steep...

elliott34 · on Oct 21, 2014

Considering enterprise BI software/platforms these days can costs upwards of 4k per month, this is roundoff error.

jackgolding · on Oct 22, 2014

I agree, just doesn't fit the projects I wish to use it for.

datahipster · on Oct 20, 2014

For those cases where I have to dip into R for specific functionality, I'll use Rpy2. Pandas has great support for translating Pandas data frames to R data frames!

Bootvis · on Oct 20, 2014

I have some experience using Pandas but I'd love to read about peoples experience using it in production or in large teams. Please share!

PudgePacket · on Oct 20, 2014

It seems quite quick to do work on large datasets, I've found the documentation lacking however.

There are no links in docs to the types that are being referenced, some types do not have documentation, some documentation is the function header with no other info ie no documentation, functions that take string formatting info eg '5min' do not have their argument possibilities documented anywhere I can find.

RayVR · on Oct 20, 2014

Documentation is mostly organized around trying to explain how to use some specific feature, which is usually not the best format for me, but it may be for others.

The argument possibilities has always been an issue for me. In general, I have found, if you have non-homogenous data, Pandas is your best bet due to how general it is, even if it is sometimes frustrating when it forces a generalization on your data (e.g., try doing type conversions on numpy.datetime64, it lacks any sort of intuition).

The distinction between a Series and a dataframe is something I always found pretty silly/frustrating and I wonder if it was the result of an early implementation issue rather than a logical simplification.

maybe I'm particularly ignorant of some issues since my use case is perhaps more straightforward than some others but I've built an entire labelled data library for my team and it is easier for us to operate under similar primitive beliefs to the numpy ndarray, i.e., it's always an ndarray, adding a column (or dimension) does not change the type of the object and the associated methods and indexing in one particular way vs another (df['a'] vs df[['a']]) does not change the type of your object.

If I'm missing the point of Series I would love to see them justified or a use case referenced.

raymondh · on Oct 20, 2014

Really? You found the documentation to be lacking.

FWIW, there are over 1500 pages in the docs including a short-over view, tutorials, extensive feature coverage, and interaction with other tools: http://pandas.pydata.org/pandas-docs/version/0.15.0/pandas.p...

The docs may have some issues, but they certainly can't be characterized as lacking.

Bootvis · on Oct 20, 2014

In my eyes, you both make a valid point: yes there's plenty of documentation but sometimes I just can't find what I'm looking for. I like how the book[1] is structured, it really helped me but it isn't complete.

Don't get me wrong, I'm grateful for all the work and I know I haven't contributed much but I think the online could be improved with more examples and recipes.

Edit: There really is no excuse, getting started is easy[2].

[1]: http://shop.oreilly.com/product/0636920023784.do

[2]: http://pandas.pydata.org/developers.html

EmlynC · on Oct 20, 2014

Correct me if I'm wrong but perhaps you had the same issue as me. The documentation is plentiful, lots of good examples and the book, similarly, increases with a nice linear complexity from basic "how do I select a (cell | row | column) ..." to full blown how do I do timeseries analysis on a dataseries pulled in from a remote source.

The issue I had was not the documentation but the language of pandas mirrors the language used in R (I think this is something Wes McKinney intentional did) and it's the burden of all that new verbage that makes the documentation harder to sift through. Some choice exampels; "melt", "stack/unstack" and "reindex" — necessary, I grant you, so that functions can be aptly named and in turn encapsulate vectorised procedures that are composable.

I found that the documentation was harder to search because I lacked the domain language and the documentation, for better for worse, doesn't dawdle with educating the reader about the verbage — worked examples often provide a easier route. It reads like a mathematical proof rather than prose and I used to think that the documentation was too terse but now I appreciate that probably just succinct.

ehurrell · on Oct 20, 2014

I find it excellent in production, and it's one of the backbones of Python as a 'data science' language. Being able to leverage dataframes in the same environment you build the webserver that serves the results is a really powerful thing. There have been times when the documentation has been difficult (mostly when it comes to already difficult to search for operations though).

plafl · on Oct 20, 2014

It's a good library, but I have the feeling that it tries to do too many things, or that you can do the same thing in too many different ways.

easytiger · on Oct 20, 2014

Build & installation from scratch is far too complex. Relies on fortran libraries of numpy which I can't get built inside my corporate domain

elliott34 · on Oct 20, 2014

Why is that panda's problem? If you have scipy and numpy installed, it's just sudo pip install pandas....

jdf · on Oct 20, 2014

Just FYI, when 'sudo pip install pandas' doesn't work (which it didn't for me recently), you'll get no love upstream:

https://github.com/pydata/pandas/issues/7517

As @mynegation notes, you can use Anaconda (or virtualenv).

easytiger · on Oct 20, 2014

You can't do that in a closed envrionment

ionforce · on Oct 20, 2014

> You can't do that in a closed envrionment

This is the real killer. Any leads on making it less closed?

Also, I'm unfamiliar with how pip works, but you can't even install into your local user profile (i.e. w/o root)?

mynegation · on Oct 20, 2014

Use Anaconda Python distribution (https://store.continuum.io/cshop/anaconda/). Comes bundled with pandas, numpy, scipy (and much more). Does not require admin privileges. Updates and installations do not require working compiler (binary packages for your platform are downloaded).

easytiger · on Oct 20, 2014

Thanks. Tried this and couldn't get it to run on RH5.5

pwang · on Oct 21, 2014

What issue did you have? Please email info@continuum.io with the information - we would love to know more about your problem and help you resolve it.

damon_c · on Oct 20, 2014

Sometimes, even when installing to a local virtual environment, if there are required libraries that require, for example a Fortran compiler, you'll need to install that on the system.

makmanalp · on Oct 20, 2014

No discussion of the new features?

Categorical is awesome, and equivalent to R's c() iirc. This should make plots easier in terms of automatically deciding whether to facet something, or showing legends nicely etc.

The memory usage feature is super neat.

Also for those of us stuck with STATA, the to_stata() and read_stata() just got much better.

I'm eagerly awaiting a numpy native NA value instead of np.NaN.

japaget · on Oct 20, 2014

Change log: http://pandas.pydata.org/pandas-docs/version/0.15.0/whatsnew...

In particular, note that NumPy 1.7.0 or newer is required.