Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Answering questions in a rapid, interactive way (, while using C to be efficient enough that one can run it on millions of rows):

  # Given a dataset that looks like this…
  > head(dt, 3)
      mpg cyl disp  hp drat    wt  qsec vs am gear carb          name
  1: 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     Mazda RX4
  2: 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Mazda RX4 Wag
  3: 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    Datsun 710
  
  # What's the mean hp and wt by number of carburettors?
  > dt[, list(mean(hp), mean(wt)), by=carb]
     carb    V1     V2
  1:    4 187.0 3.8974
  2:    1  86.0 2.4900
  3:    2 117.2 2.8628
  4:    3 180.0 3.8600
  5:    6 175.0 2.7700
  6:    8 335.0 3.5700
  
  # How many Mercs are there and what's their median hp?
  
  > dt[grepl('Merc', name), list(.N, median(hp))]
     N  V2
  1: 7 123

  # Non-Mercs?
  > dt[!grepl('Merc', name), list(.N, median(hp))]
      N  V2
  1: 25 113

  # N observations and avg hp and wt per {num. cylinders and num. carburettors}

  > dcast(dt, cyl + carb ~ ., value.var=c("hp", "wt"), fun.aggregate=list(mean, length))
     cyl carb hp_mean  wt_mean hp_length wt_length
  1:   4    1    77.4 2.151000         5         5
  2:   4    2    87.0 2.398000         6         6
  3:   6    1   107.5 3.337500         2         2
  4:   6    4   116.5 3.093750         4         4
  5:   6    6   175.0 2.770000         1         1
  6:   8    2   162.5 3.560000         4         4
  7:   8    3   180.0 3.860000         3         3
  8:   8    4   234.0 4.433167         6         6
  9:   8    8   335.0 3.570000         1         1


I used slightly verbose syntax so that it is (hopefully) clear even to non-R users.

You can see that the interactivity is great at helping you compose answers step-by-step, molding the data as you go, especially when you combine with tools like plot.ly to also visualize results.



What a lot of people don't get is that this kind of code is what R is optimized for, not general purpose programming (even though it can totally do it). While I don't use R myself, I did work on R tooling, and saw plenty of real world scripts - and most of them looked like what you posted, just with a lot more lines, and (if you're lucky) comments - but very little structure.

I still think R has an atrocious design as a programming language (although it also has its beautiful side - like when you discover that literally everything in the language is a function call, even all the control structures and function definitions!). It can be optimized for this sort of thing, while still having a more regular syntax and fewer gotchas. The problem is that in its niche, it's already "good enough", and it is entrenched through libraries and existing code - so any contender can't just be better, it has to be much better.


Completely agree. dplyr is nice enough but the verbose style gets old fast when you're trying to use it in an interactive fashion. imo data.table is the fastest way to explore data across any language, period.


I strongly agree, having worked quite a bit in several languages including Python/NumPy/Pandas, MATLAB, C, C++, C#, even Perl ... I am not sure about Julia, but last time I looked at it, the language designers seemed to be coming from a MATLAB type domain (number crunching) as opposed to an R type domain (data crunching), and so Julia seemed to have a solid matrix/vector type system and syntax, but was missing a data.table style type system / syntax.


Julia v0.7-alpha dropped and it has a new system for missing data handling. JuliaDB and DataFrames are two tabular data stores (the first of which is parallel and allows out-of-core for big data). This has changed pretty dramatically over the last year.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: