I work in pandas 95% of my day doing data automation tasks, manipulating sql queries, and data-munging for machine learning. It is literally life changing for someone like me.
I used to program solely in R, but after discovering pandas I really have no need to go back to R. My project workflow consists of several IPython notebooks+pandas+sklearn.
Works extremely well in production, as in, on a flask web server, as well.
For the particular tasks of "data automation tasks, manipulating sql queries, and data-munging for machine learning," Python is indeed better than R, especially with sklearn
For other applications (especially charting and data manipulation with ggplot2 and dplyr respectively), R has an edge.
I've used that library, and I really don't like it at all. They tried to bring the R syntax to python, which ends up looking awful and is missing the point of the Graphical Grammar. In the same way that every language has it's own way of expressing control flow, every language should have it's own way to express the Graphical Grammar. We don't need R's GGplot2 in python, we need a pythonic way to express the Graphical Grammar.
If I had stronger python-fu I would love to build "GGPy".
Maybe. The plots look a lot like GGplot2 plots, any the syntax looks like python, but I haven't dug in to it to see if it builds plots using the Graphical Grammar under the hood.
we've shifted to plot.ly for visualization and moved away from expensive BI tools. We set up python/pandas scripts on CRON to output realtime data to our local plot.ly web server, which makes a local copy of the data and updates the chart. You simply embed the chart as an iframe where ever want internally, and BOOM, you've got a real time chart (beautiful I may add).
For those cases where I have to dip into R for specific functionality, I'll use Rpy2. Pandas has great support for translating Pandas data frames to R data frames!
It seems quite quick to do work on large datasets, I've found the documentation lacking however.
There are no links in docs to the types that are being referenced, some types do not have documentation, some documentation is the function header with no other info ie no documentation, functions that take string formatting info eg '5min' do not have their argument possibilities documented anywhere I can find.
Documentation is mostly organized around trying to explain how to use some specific feature, which is usually not the best format for me, but it may be for others.
The argument possibilities has always been an issue for me. In general, I have found, if you have non-homogenous data, Pandas is your best bet due to how general it is, even if it is sometimes frustrating when it forces a generalization on your data (e.g., try doing type conversions on numpy.datetime64, it lacks any sort of intuition).
The distinction between a Series and a dataframe is something I always found pretty silly/frustrating and I wonder if it was the result of an early implementation issue rather than a logical simplification.
maybe I'm particularly ignorant of some issues since my use case is perhaps more straightforward than some others but I've built an entire labelled data library for my team and it is easier for us to operate under similar primitive beliefs to the numpy ndarray, i.e., it's always an ndarray, adding a column (or dimension) does not change the type of the object and the associated methods and indexing in one particular way vs another (df['a'] vs df[['a']]) does not change the type of your object.
If I'm missing the point of Series I would love to see them justified or a use case referenced.
In my eyes, you both make a valid point: yes there's plenty of documentation but sometimes I just can't find what I'm looking for. I like how the book[1] is structured, it really helped me but it isn't complete.
Don't get me wrong, I'm grateful for all the work and I know I haven't contributed much but I think the online could be improved with more examples and recipes.
Edit: There really is no excuse, getting started is easy[2].
Correct me if I'm wrong but perhaps you had the same issue as me. The documentation is plentiful, lots of good examples and the book, similarly, increases with a nice linear complexity from basic "how do I select a (cell | row | column) ..." to full blown how do I do timeseries analysis on a dataseries pulled in from a remote source.
The issue I had was not the documentation but the language of pandas mirrors the language used in R (I think this is something Wes McKinney intentional did) and it's the burden of all that new verbage that makes the documentation harder to sift through. Some choice exampels; "melt", "stack/unstack" and "reindex" — necessary, I grant you, so that functions can be aptly named and in turn encapsulate vectorised procedures that are composable.
I found that the documentation was harder to search because I lacked the domain language and the documentation, for better for worse, doesn't dawdle with educating the reader about the verbage — worked examples often provide a easier route. It reads like a mathematical proof rather than prose and I used to think that the documentation was too terse but now I appreciate that probably just succinct.
I find it excellent in production, and it's one of the backbones of Python as a 'data science' language. Being able to leverage dataframes in the same environment you build the webserver that serves the results is a really powerful thing. There have been times when the documentation has been difficult (mostly when it comes to already difficult to search for operations though).
Use Anaconda Python distribution (https://store.continuum.io/cshop/anaconda/). Comes bundled with pandas, numpy, scipy (and much more). Does not require admin privileges. Updates and installations do not require working compiler (binary packages for your platform are downloaded).
Sometimes, even when installing to a local virtual environment, if there are required libraries that require, for example a Fortran compiler, you'll need to install that on the system.
Categorical is awesome, and equivalent to R's c() iirc. This should make plots easier in terms of automatically deciding whether to facet something, or showing legends nicely etc.
The memory usage feature is super neat.
Also for those of us stuck with STATA, the to_stata() and read_stata() just got much better.
I'm eagerly awaiting a numpy native NA value instead of np.NaN.
I used to program solely in R, but after discovering pandas I really have no need to go back to R. My project workflow consists of several IPython notebooks+pandas+sklearn.
Works extremely well in production, as in, on a flask web server, as well.