Can you backup your performance claim? The last time I checked a benchmark that ...

riskneutral · on June 14, 2018

Here is one example: https://appsilondatascience.com/blog/rstats/2017/03/02/r-fas...

I also base this on my own experience. I typically work with 2-3 million row datasets. I found that doing certain data operations was quite slow in plyr but a lot faster in data.table. It’s possible that if I had spent time reordering my plyr pipelines and filtering out unneeded columns or rows, then it would have worked better. However, data.table doesn’t require such planning ahead and thinking about what columns/rows you need to send to the next operation in a pipeline, because multiple operations can be executed from a single data.table call, and the underlying C library is able to make optimized decisions (like dropping columns not requested in the query), similar to an in-memory SQL database. So between dealing with slow code while doing interactive analysis, and/or having to spend time hand-optimizing dplyr pipelines, I found data.table to be a significant improvement in productivity (other than the one-time effort of having to rewrite a few internal packages/scripts to use data.table instead of dplyr)

stewbrew · on June 14, 2018

Thanks for the reference. Why don't you keep your data in a DB? I load almost anything that isn't a small atomic data frame into a RDBMS.

BTW one thing that always made me avoid DT (I even preferred sqldf before dplyr was created) was its IMHO weird syntax. I always found the syntax of (d)plyr much more convenient. ATM it seems to me that dplyr has won the contest of alternative data management libraries. I cannot remember when I last read a blog post, article, or book that preferred DT over dplyr. I'm old enough to have learned that wrt libraries, it's wise to follow the crowd.

stewbrew · on June 14, 2018

About that article: I assume that DT uses an index for that column while dplyr does a full search. If that's really the case the result wouldn't be that much a surprise.