Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't mean to disparage pandas, which is a library that does a lot of things fairly well. But as an API for data manipulation I find it very verbose and it doesn't mesh with a "functional" way of thinking about applying transformations.

Generally, I've even preferred Spark to pandas, though it's hardly less verbose. Coming from R, it's much slower than data.table and nowhere near as slick and discoverable as dplyr. Its system of indices is a pain that I'd rather not deal with at all (and, indeed, I can't think of another data frame library that relies on them). I hate finding CSVs that other data scientists have created from pandas, because they invariably include the index ...

Handles time series really well, though.

Recently I've been using polars (https://github.com/pola-rs/polars). As an API I much, much prefer it to pandas, and it's a lot faster. Comes at the cost of not using numpy under the hood, so you can't just toss a polars data frame into a sklearn model.



Agreed on your major points.

That being said: > I hate finding CSVs that other data scientists have created from pandas, because they invariably include the index ...

This is also default in R, with row numbers (like I have ever needed them). To be fair, it's gotten better since people stopped putting important information in rownames.

Polars looks interesting, thanks for the recommendation!


< I hate finding CSVs that other data scientists

Ideally you should be using the parquet format which will use the binary format, preserve column types and indexes [df.to_parquet(<file>); df = pd.read_parquet(<file>)]

You can get away from a lot of problems by simply avoiding text files




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: