For anyone interest in R without a background in statistics: I would highly recommend learning the two in parallel (if not statistics first).
R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.
However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".
Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.
This is a very, very good point. Though, many of the functions that R provides just won't make any sense at all if you don't have an intuition for the statistics behind it. I have found myself reading the papers published about specific functions in order to understand the results.
Do you have any resources that you suggest for beginners in statistics looking to learn on their own?
Here's the resources I've used and found helpful (plus notes) for getting a least a basic grounding in statistics.
Khan Academy + "Head First Statistics"
-No matter what direction you want to go with your learning and before you start getting into street fights over Bayesian vs Frequentists philosophies, "stats 101" is the common base of 'statistics' for: Physicists, Engineers, Sociologists, Doctors, Economists, Journalists and more. Yes there are many "truths" in 'stats 101' that can be questioned, but this is the common language you'll want talking to any of these people.
"What is a p-value anyway?"
-A very quick, non-math heavy read on avoiding magical thinking in classical statistics.
"Data Analysis with Open Source Tools"
-I really love this book because it steps way out of "stats 101" land and gives a lot of really great real world advice and touches on many mathematical tools that can help you understand data. The chapter "What you really need to know about classical statistics" gives a refreshing overview of all the basic stats you've covered and explains many things that seem strange. Also the bibliography/recommended reading list from this book is fantastic.
The biggest thing I've noticed as I've learned more is that statistics is a gigantic field, and can mean extremely different things depending on what field you're in. Obviously learn the advanced statistics relevant to your interests, but when you get a chance listen to how other people think about stats, and keep learning.
I'm taking this right now. Can somebody provide some perspective on how thoroughly the course prepares you for using R? Does it give you a good enough grounding in statistics so that you won't shoot yourself in the foot with R?
How about not-beginners looking to refresh / deepen their intuitions?
I've recently been working with the Python toolset in this space -- pandas, numpy, matplotlib -- and run smack dab into my rusty regression analysis. In particular I need to better understand the distribution assumptions underlying the error distributions and the variances around the coefficient and intercept values.
Any suggestions for some deeper study / refresher?
Depends on the data sets you want to work with. For straight-up linear regression, with a heavy emphasis on observational data appropriate for microeconometrics, "Introductory Econometrics: A Modern Approach" by Jeff Wooldridge is absolutely phenomenal (an old edition is fine). (This is usually assigned for advanced undergraduate econ majors or non-advanced masters students; I don't know what the equivalent would be for undergraduate stats majors).
For more "intuition about working with data, especially if you're a visual person," Howard Wainer's books are wonderful; one example is "Graphic Discovery: A Trout in the Milk and Other Visual Adventures." They're non-technical, short chapters, discussions of different data sets.
Bill Cleveland's "Visualizing Data" and "Elements of Graphing Data" cover the same material -- graphing data -- at a more technical level. I don't know Cleveland's books would help with the issues you asked about, but... they are amazing books and if you're interested in the subject at all I can't recommend them highly enough.
I don't have any free recommendations, unfortunately.
Seconding Wooldridge. It's the only economics text I'm keeping from university - I'm getting rid of all the rest. It really digs into the material and highlights the pitfalls and incorrect assumptions in regression and forecasting. I'm planning to consult it when I start working on analytics in current/future projects.
Honestly, the only way to understand regression is to study something like Mccullagh & Nelders book. Anything else and you are going to have a very hard time really being useful without misinterpreting the results. There are some real subtleties to interpretation of regression coefficients, and more importantly structuring your data in such a way that you will answer the questions you want.
It's not an easy book, but if you've gone through to at least third year level in statistics it's approachable and you will understand it to a deep level.
Gelman and Hill is a wonderful and under-rated book. My guess is that its clumsy title (Data Analysis Using Regression and Multilevel/Hierarchical Models) hides the fact that it's an introductory textbook that takes the reader from knowing nothing to eventually constructing complex Bayesian models. Plus, it's a pretty good tutorial on R and BUGS.
While it is predominantly a statistics language there is also a huge wealth of data manipulation capabilities in functions like plyr, aggregate, *apply, ave, subset, etc.
Just in terms of organizing data sets, ignoring any statistical analysis, R is fantastic.
I've found Python + Pandas much better in this regard than R. Maybe it's just me, but for grouping, indexing, and manipulating tabular data, Python syntax just makes more sense.
That said, R is better for stats and matrix operations.
They might have borrowed from R.
Wes McKinney admits to being influenced by R especialy data frames...but it makes data analysis all the more easier when i can do everything i want within the the Python environment.
pandas is proving to be a bit of a longer learning curve i must admit, but then the python environment and native Matplotlib support made life oh so much simpler.
The split-apply-combine framework dealing with group by tasks (http://www.jstatsoft.org/v40/i01/paper, not that there aren't other precedents) for one. But generally, Wes has used R to figure out what people want to do, and then ported an elegant interface to python.
Can you elaborate on data.table being 'a game changer'. I am inclined to agree, but I'm am just starting to get a handle on it. I am still hesitant and switching between sqldf, reshape2, base::merge and data.table more than I would like. Do you think it could become a dominant method for data preparation?
Python has PyTables which complements Pandas nicely and seems to offer the same sort of features as data.table (note, I've not actually used data.table)
I am using R to analyse and document (knitr and latex) epidemiologic data which does not involve parsing a lot of text to extract my analysis data set. Data preparation for this type of research involves more combining data from different source tables, restructuring repeated measures, etc. I only know how to do that using R. Can Python be incorporated into the knitr literate programming framework and is it worth learning another language?
This is great. I've nearly completed a class at UC Berkeley which was almost entirely in R and I can say with certainty that it is a marvelous language. It is powerful, concise, and has an incredibly robust community. I've experimented with many programming languages, but I have not used one which allows you to experiment as rapidly as R.
I'm currently going through the Codeschool lessons to see if there is anything that I may have missed in my class. So far so good!
Edit: The most important thing that I didn't see covered in the course was RStudio. Considering that R is more of a scripting language that a programming language, I've found that RStudio is instrumental is using the language to it's full capacity. While it's certainly possible, and in some cases optimal, to use R from the command line, my experience is that the GUI features of R studio are incredibly powerful. The ability to browse data frames and have graphs show up in the context of your work has been very useful for exploring and understanding data. Otherwise, the course does a pretty decent job introducing readers to the language and it's data strcutures.
I'm just going through this and while I love the concept, there is one criticism from me. The course is very comfortable to walk through, but it doesn't make use of the benefits of the online environment.
It is just "this is how it goes, now type it" kind of teaching. I am almost done with the second session, and most likely have completely forgotten most of the content from the first. If anybody else is going to try teaching stuff in a similar way, please let me try to play/try out stuff as early as possible. Even if the exercises are completely pointless, please make them a bit harder than just exchanging an "+" for an "-". It feels so pointless having such a great learning environment and not using it to make it feel less of a brain-dump process.
Same here. And sometimes after being introduced to a concept or function, I will wonder, "hmm, what if i do this.." but the embedded REPL doesn't allow you to tinker.
Thanks. I was sorely disappointed yesterday when I found out that Coursera classes follow a strict schedule (and that I couldn't look at the material right then) and that I wouldn't be able to try the Data Analysis with R course on their site.
I feel like it starts off a little bit on the wrong foot by introducing basic types as scalar variables. In reality R has no scalar variables, everything is a vector, list, and scalars are immediately coerced into a vector eg:
> is.vector("a")
[1] TRUE
This might seem like nitpicking but it leads to a world of confusion when programmers used to languages with scalars start trying to use R that way and it took me several months of confusion and weird bugs before I finally clicked and started understanding R better.
Can you give an example where the confusion between a scalar and a length one vector is important? I'm trying to figure out how to better teach R to people familiar with other languages and understanding your stumbling blocks would be v. helpful.
For a strong conceptual grasp of how the language work, I think it is fundamental that students learning R (especially those with a history in other programming languages) understand that there are no scalars in the language. The main argument that I would make for this is that nearly all R functions can operate on vectors with a length grater than one. By understanding that when you send a "scalar" to a function you are actually sending a vector, I believe it is much more conceptually clear that you can, and should, send larger vectors to functions and can receive the expected results. This is in comparison to most other programming languages where it would be necessary to iterate over a list or array in order to operate on each individual element.
For example, if you told somebody familiar with, say, PHP to add 2 to each element in a vector, they would likely break out the oh-so-familiar for loop to iterate over each element and apply the transform. This is completely suboptimal in R, as you could just do vector + 2 and receive the exact same thing.
This isn't quite what you asked for, but I've run into problems where the distinction between scalars and 1x1 matrices was important. iirc, something like
crossprod(y, crossprod(A, y)) * V
where A was nxn, y was nx1, and V was an arbitrary matrix throws an error while
YES. I had to relearn misconceptions about R four, five years down the line but the truth is not widely advertised (that everything in R is actually a vector).
I really like this, but one big complaint is that the auto-scroll after completing a little task isn't correct. So each time after I "pass" a particular section, I have to manually scroll down with my trackpad.
Gregg here from Code School.. we actually tried making it autoscroll down at first. It was too distracting and annoying. Scrolling down when you're ready to continue just felt more natural.
I found another rare, but annoying thing - on German keyboards the "~" is hidden away on the combination "Alt Gr and +", when I try to type that into the prompt under Firefox 17.0 nothing happens. Workaround is to copy-paste. Under Chrome 23.0.1271.95 the same key-combination inserts single quotation marks, weirdly enough, same workaround.
As far as I can tell, it's not possible to go through the lesson using just a keyboard; a mouse is required to scroll or to change focus from the embedded interpreter back to the page itself so that the arrow keys will scroll.
R doesn't seem to get much frontpage love on HN, or even if it does and I haven't seen, what would people suggest is the technology for statistics going forward? I really hoped it would be around Clojure (e.g. Incanter[1]) and not Python, for entirely selfish reasons.
The old joke is that the reason why R is awesome is that it was created by statisticians, and the reason why R sucks is that it was created by statisticians. As an every-day user of R, I can't help but think that description is perfect.
It also means that R isn't going away. It is getting more popular, and there is a ton of work on improving the runtime, which will only mitigate people's itches to move away from it. But most importantly, it has the network effects to its advantage.
Even as a Clojure lover, I can't see R ever being substituted. I see more hope for the Renjin project than I see for alternatives like Pandas/SciPi/NumPy, Julia, or Incanter.
And that $300 price tag goes up to $1000 if you want a license that lets you use Mathematica in any sort of commercial or professional context (it only goes up to $500 if you only want to use it an academic setting).
I highly recommend supplementing a course like this (where you learn about the language's ecosystem) with the R Cookbook from O'Reilly. It's been a lifesaver for me, and helped me learn R over the course of a few months of needing it at a new job.
Now I find that I need to learn something else for data munging- R is terrible at data manipulation and querying.
The querying bit is solvable with the incredibly useful sqldf package from Google. The package allows you to use SQL syntax to query your data.frames (by creating, populating, querying and deleting a psql table in the background).
Example: I have a dataframe named dfrm with columns named "id" "height" "name"
If I want the heights of all people whose names start with D, I would need to use:
> dfrm$height[which(substr(dfrm$name,1,1)=='D')]
Terse, but painful. Compare to:
> sqldf("select height from dfrm where name = 'D%'")
I actually find base R excellent for data munging and manipulation, even without using additional packages. Here is a reproducible example that very easily accomplishes what you were trying to do (first two lines just set up a sample data frame)
Basic R functions like subset, transform, with(in), reshape, aggregate, (a,ma,ta,sa,va}pply, match, grep(l), by, split, table, etc. allow you to accomplish just about any data frame munging you might want. Add on the plyr, reshape2, data.table, xts/zoo packages and you're ready to tackle just about anything.
I'm not a big fan of sqldf because imo R is not supposed to act like SQL. Using sqldf in practice would require a lot of query string manipulation and takes away from the nice functional features of R.
Nevertheless, it is very easy to write incomprehensible R code. The best way to avoid this is to take one of the existing style guides (Google, Hadley Wickham's) and adopt it seriously.
One drawback with R is that in computations like this, several intermediate data structures with one dimension equal in length to nrow(dfrm) are allocated. Traversing an iterable of tuples is a simple way to think about it, is efficient, and ties in with other technologies e.g. relational databases. R is often people's first language (e.g. science graduates) and those people would be better off learning how to iterate over tuples than learning the obscure bestiary of data structure manipulators you point out.
I've been using R extensively for the past 12 months and have achieved a high level of comfort with the language. Now I find myself at a wall because of my lack of math and statistics background. I've taken R as far as I can, or put more properly, R has taken me as far as I can go without learning more math.
With that said, I have little reason to use R right now except for it's excellent plotting ability with ggplot2. Otherwise for data munging, wrangling, connecting to databases, doing unit testing, etc - R is a giant PITA. Better to stick to Python for that. And as I learn D3, I think I'll use R even less for visualization.
Therefore R will only be valuable to me once I can harness it's power for data mining and machine learning, which is it's killer feature, IMHO.
Do you think the love it/hate it dichotomy over R for data 'munging' stems from different ways of thinking about data. I'm slowly getting comfortable in R since returning to work in a sort of freelance arrangement that makes me highly motivated to use free or affordable tools. I started out, however, in clinical epidemiology data analysis using MS Access and SAS. I still think of data in terms of rectangular data sets, RDBMS and sql. I have a hard time with vector and matrix related terminology. I think I'm going to end up using reshape2 and data.table a lot since sqldf is noticeably slower even with my small data sets (compared with web analytics, finance, etc). The problem with sqldf and variable names containing a dot is a real drag as I try to adopt good coding style. I am missing the clarity and familiarity of sql statements, though, as I try to find my new workflow in R. I hope a more unified approach to data munging emerges soon.
BTW, I totally espouse the reproducible research (RR) method of documenting study design, analysis, interpretation... I am loving knitr and latex for RR so I can no longer imagine using different tools for data munging and analysis.
I'm constantly impressed by the high quality content that Gregg and the rest of the Code School folks put out on a regular basis and Try R is no exception. Really excellent work! Looking forward to getting through the rest of it.
So, I've started using R for some stuff I'm doing at work. I have to say that I'm basically treating it as a non visual spreadsheet. Seems everything I've used it for so far, I could have done with excel. Am I doing it wrong?
Not doing it wrong, but only using a subset of R. For instance, R has powerful data manipulation ability that can get your data to use the subset of R that does what Excel does. R also has a huge library of packages that go way beyond what Excel can do, especially for statistics. Sure, you can do an ols regression in Excel, but you can you do a complicated machine learning model?
I should have been clearer. I've never been "good" at using Excel. So far, I've really only done things that I would think you could do in Excel. Might be a touch overkill, but I'm not sure. I am curious which data manipulation items you are referring to. Any good pointers?
I am going through the Machine Learning for Hackers book, though. So far it has been interesting. I guess I never realized that machine learning is essentially statistics. (Or am I looking at that incorrectly, too?)
The biggest draw to me is actually the non-visual part of it. I don't get hung up on silly visual things such as column width. Or, where to put the plots I make.
Granted, this has its own downsides. But so far I'm loving it. And yes, being able to essentially save off just what I did so that I can rerun the same tasks again later on a new data set is really really nice.
I am currently reading the "The Art of R programming" by Norman Matloff and it is a good book for R beginners. Some familiarity with maths and stats basic is obviously required though.
Was keen on trying this out. Crashed on me on the variables bit on the first page; just span and span. Happens a lot with these online interpreter things.
Oh man I really wish I had this at the beginning of the semester. I'm towards the end of a grueling stats course - difficult, and not the best professor. Each homework assignment I feel like I barely scrape by without really learning. This is the first time I've ever felt this way about school.
I was very interested in the course syllabus for 'Statistics One' by Prof. Andrew Conway. I missed the course on Coursera and now I'm unable to view the course archive. Does anyone know where I can find the lectures?
(Yes, I've googled some.)
I took that course when it was running on Coursera, and I honestly can't recommend it (in its current state, at least) to anyone looking to learn basic statistics.
It covered a lot of material, but the quality and order of coverage was very inconsistent. The first couple weeks were fine, but it felt really odd to jump from correlations and scatterplots into regression, then come back to t-tests and AOV afterwards. There were also some errors in the R code on the slides, which led to a lot of confusion on the discussion forums during the class. As a student, it didn't feel like the class's pedagogical approach was very good, and I'm now finding myself using other resources to fill in the gaps.
If you'd like to hear more about those other resources I'd gladly post a list, but they're mostly Python-centric. One that I can whole-heartedly recommend even if you stick with Prof. Conway's class is the set of lectures from Roger Peng's "Computing for Data Analysis" class on Coursera. The course itself isn't available at the moment, but the videos are on his Youtube channel[1]. It teaches R from a programming perspective, and you'll find the content invaluable once you start writing R code that's more complex than a couple stats functions and a plot.
Hi, thanks a ton for the detailed response. Luckily I don't really need a Stats 101, so I don't think I'll mind him jumping around. If, of course, it does get a bother I know which course is right for me. Till then I'm also doing a bit of Thrun's Udacity Stats course on the side.
I would actually appreciate a list of resources in Python, that's what I like using most! I have downloaded a copy of "Think Stats", but haven't gone through it yet.
Sorry for the late response, I completely forgot about this post!
Looks like you're on the right track though, "Think Stats" and Udacity's stats class were the main things I was going to recommend. I'd also recommend checking out IPython's web notebook for inline charts and general awesomeness, and the Pandas library for an R-style data frame built on top of NumPy. The best resources for learning about IPython are probably screencasts, and the author of Pandas has a book out named "Python for Data Analysis" that covers IPython, NumPy, Pandas and some matplotlib.
And no I don't think so, as they claim "Amara gives individuals, communities, and larger organizations the power to overcome accessibility and language barriers for online video."
You just helped me watch a course that was posted for free online. Isn't that the point of this? :)
I use R almost exclusively and I absolutely love it. But people can get a little carried away with picking and recommending the "best" language for particular task.
If you feel very comfortable and skilled in Matlab, and you aren't finding that there are regular situations where Matlab can't do what you want, I wouldn't really advocate switching. Same goes for someone working primarily in Python.
The biggest reason I would recommend switching away from a language you're already very comfortable in would be the availability of statistical methods that aren't present in your language of choice. Statisticians tend to work in R, so much of the cutting edge work ends up in packages on CRAN.
So I wouldn't dump Matlab if you're happy with it, unless you're just looking to learn something new, out of intellectual curiosity, which is always fun.
I think your response here and to the other poster above is quite sound. I also program regularly in Matlab, R, and Python, but when it comes to data analysis, I do find that R is just much more concise than the other two (the data manipulation and statistical analysis tools are more high-level). Learning R and going through its tutorials (the MASS book and actually the S-PLUS Statistics Guides) and learning about functions available to me made me learn a lot more about stats. I sometimes use Matlab for image processing and optimization, or some simple simulation but less and less these days (trying to replace it with SciPy since I use Python a lot in my workflow).
But I also agree that if you're already proficient with Matlab and happy, then maybe you don't need to learn R (though often you can be blissfully ignorant of your possibilities if you are unaware of the vast libraries that another language/environment offers).
m <- lm(y ~ wage + degreeAttainment + jobTenure, data = mydataset)
summary(m)
plot(m)
predict(m, myotherdataset)
Things like heteroskedasticity-robust hypothesis testing, GLM, time-series models are supported without purchasing extra packages or coding it yourself. The graphics are much better and lattice/trellis plots are supported natively (I don't think there's any good way to them in matlab).
Basically, the stats and graphics are easy enough to type that you can do all of the exploratory data analysis that you (should) feel guilty for skipping when you use matlab. :)
I've used both extensively (though not for statistics). R's syntax is a little more C-like and consistent than MATLAB's - however the biggest difference is documentation.
R's, like most open source documentation, is rather terse and often very unsatisfactory. This gets especially apparent once you get into 3rd party libraries and use things like Bioconductor. You’ll have no idea how things are designed to be used, and without a guru at hand to walk you through you’ll be in a world of pain. Googling for solutions is also very difficult (even using something like RSeek). There are some archaic boards that sometimes have what you need, but often you'll get stuck and not know what to do. What’s nice about MATLAB is that all the libraries are made by a competent team of engineers and they put in the money/resources to have good documentation. Even the more abstract rarely used libraries have decent documentation. In R, if you try using non standard libraries, you’re gunna get screwed.
The IDEs for R are also worse. RStudio is quite nice, but it's really bare-bones compared to MATLAB's IDE. The one really neat thing about it is that you can host it on a server and then remotely work on your work by just going to a URL.
Also I think there are legacy issues in R (though MATLAB has those too). So there are for instance matrices, dataframes and lists (which are list vectors, but not at all). Why there are these three formats that fundamentally do the same thing is beyond comprehension. (Maybe someone can give some insight) Functions will randomly return one type or another. I always find myself fighting to keep the types consistent and R keeps trying to mess with me. In MATLAB everything is a matrix, so that makes things a lot easier
Fundamentally the issue is that MATLAB has a much larger user-base than R, so you'll just have a much easier working with it.
If R's documentation and community was on the same level as MATLAB's then I would maybe consider recommending it. If you work in genetics and you need to use something like Bioconductor, then R is a must I guess. Most other libraries are Fundamentally it's just some syntax differences.
The expression "You get what you pay for" is really pertinent here.
Note: I personally still use R for plotting, because I’m personally more familiar with it. Otherwise I try not to touch it. Code organization for me always gets messy, but I guess that’s cus I’m used to writing in OO languages.
This pretty much perfectly illustrates my comment below, pointing out that these sorts of recommendations are entirely subjective and useless.
Many of your points are quite subjective. I could do the same thing with Matlab. For instance, I find it mind boggling that anyone could get anything done when you have to devote a separate file to every single functions. That seems incomprehensible to me. And yet, I realize that that's probably a mostly subjective thing that you get used to.
Personally, I find R's documentation excellent. When people complain about it, it's usually because they have mistaken it for a tutorial. It's not. It's documentation.
Without any data, I seriously doubt your claim that Matlab has a much larger user base. (There is considerably more activity in R on StackOverflow than in Matlab.)
Your complaint about matrices, lists and data frames is similar. Data frames exist for the same reason that there's a mean() function: a columnar data structure that holds differently data types in each column comes up so often and is considered so useful that it is built in.
pandas in Python was developed in a way that went out of its way to specifically _mimic_ these data structures because data frames are considered such a vital aspect of R.
And keep in mind that these criticisms are all coming from someone who _also_ recommended against switching...!
You make good points, however I have to take issue with the documentation.
> When people complain about it, it's usually because they have mistaken it for a tutorial. It's not. It's documentation.
I don't really see the distinction. Documentation is supposed to explain to you how to use the code. You can call it whatever you want. If it's through a tutorial, then why not. R - and especially the non-standard packages you download through CRAN - have very terse documentation that barely explain how each function works on it's own, and much less how it works in the context of the rest of the package. You can't just tell the user what goes into the black box and what comes out and expect people to be able to use your software.
Sure they're are vignettes (I think that's the term), but they're really inadequate b/c they only scratch the surface of how the package is meant to be used.
Anyways, that's my 2 cents. I've spent soooo many hours fighting with R documentation trying to figure out how to get what I needed done. Sometimes months later I would find out there is a much better way to do something that simply was not explained anywhere. I'm OK at R now, but I went through a lot of pain to get to where I am now. I'd never wish it on anyone else.
My experience with MATLAB on the other hand has always been very pleasant. I spent like 3 hours going over the tutorial on how to use it (much better then R's "Introduction to R") and I hit the ground running. When I needed something a quick search through the help or online always turned up results.
From my memory, MATLAB's documentation not only discusses the implementation, but also discusses the statistical/engineering methodology. It's overkill and can be pretty annoying (paging back and forth between different parts of the help can be somewhat time consuming) when you actually know the statistics but just want to understand the implementation. Hence the distinction between "documentation" and "a tutorial".
I don't know whether it's an explicit or implicit design choice or just a happy accident, but I'm grateful that the R documentation doesn't try to hold anyone's hand and guide them through data analysis beyond their training.
A vector is a 1d container in which each element is the same type. It's like a numpy array in python. Matrices are essentially numerical vectors with a dimension attribute. They are for numerical calculations, in particular matrix calculations. The only similarity between Data frames and matrices is that they're 2D. Think of a data frame as a spreadsheet: one column can contain dates, another floating point numbers, another a categorical variable (factor). A list is a 1d container in which each element can be pretty much anything. Suppose you had a list in which each element was itself a vector, and all those vectors have the same length. That's what a data frame is.
I came here to say this as well - Don't. This is of course my own opinion, and I can only promise this comment will get progressively more subjective the farther you read.
R's documentation is terse and unorganized. Anyone who says different is obviously someone with more experience with the language (which is not how it should be). When people complain about the documentation it's because they're trying to learn, and if the documentation isn't together learning is painful. Learning R is painful.
Next- Learning is painful, and the documentation is horrible because: the actual built in functions or extensions are a nightmare of arguments that [sometimes do/sometimes don't work]. If you plot x and you don't want the key to show up then set auto.key=FALSE, if you plot class(x) = [something else] and don't want the key to show up set colorkey=FALSE, if you want to layer plots: load this library and format your data to a new S3/S4 (object) type, add them together, then plot them (without the library you can't add these objects)... In case you didn't catch that: I wanted to layer the plots so I had to load a function that made the objects I wanted to plot add, not load a library that would change the plotting driver.
The community around R is disenchanted. If you get into the right place on the web and ask about a feature that doesn't exist: more often than not I've seen the typical guru response of the type "It doesn't work that way - that's a feature, not a problem." Change is bad, and in my experience generally discouraged.
Finally (and perhaps this really doesn't need to be said) the more I've been working with it - the more I feel like the reason it is so popular is because you can load it up read your data set, google some of the more central functions, and poof - get that set of characteristic statistical answers everyone needs to have in their [presentation/HW assignment]. If you look at the answers in this thread you get 2 types A) it's impossible to use, B) it does standard statistical analysis really easily. In my narrow world view it appeals to people who don't work with software a lot - professionals who need some automated numbers on the side without hiring someone to do it right.
That all said, I still do use R. You can do some nifty statistical analysis pretty easily and push it to a vector plot, pop it open with your favorite editor and post edit it super pretty (in a few days).
Base graphics is poweful but definitely full of crazy. However, ggplot2 is well designed as one integrated system and makes total sense. The model is a bit more complicated to understand than literally specifying what goes where as you do in most graphics apis, but the documentation is very good and there are lots of people who know it well and respond helpfully on mailing lists and stackoverflow.
The main reason I would put forward is that it's completely free and open source. That means that all your work is transferable and usable by anybody, not just people who have the ability to use Matlab (including possibly your future self).
I must admit, that sounds plausible and I'm not an expert with Matlab or Octave, but I understand the compatibility is reasonable but nowhere near complete. There seem to be a pretty good number of problems discussed here:
But really I just feel like (certainly in my field) the open source community around R is huge and thriving and it's a better bet than Octave in terms of that.
I've used Matlab for about 15 years and R for about 3. Matlab burned me pretty badly during my graduate career; two weeks after deciding to learn R, I'd recreated a data analysis/modeling/graphing problem I'd previously done in MATLAB, using about one third the amount of code in R. Perhaps R and its available libraries better suit the way I think about problems, but most people in my department agree I code circles around them in Matlab, and I've since repeated the experiment with similar results.
So, some reasons a Matlab user might want to consider learning R:
-- R is currently the lingua franca for academic statisticians. New methods papers, textbooks, and toolkits are much more likely to ship with R libraries and implementations then Matlab or anything else.
-- Speaking of statistics, the MathWorks' most recent revamping of the Statistics Toolbox is an obvious imitation and pale shadow of base R, giving you a more verbose way to do half of what base R does for statistics, and then R has everthing available on CRAN to add to that.
-- R is a smaller, yet more expressive language. To be fair, it has about the same density of WTF and non-orthogonality as Matlab (which see, Patrick Burns' "The R Inferno") but makes up for it by being much better at functional programming, and by having more of the Lisp nature in general (R is homoiconic; you can write R code that manipulates R code). If you want object-oriented programming you're about equally screwed in both languages though. R has syntactic support for named/optional arguments to functions, as opposed to Matlab's horrible InputParser/nargin hacks.
-- and in general the idea of giving names to things (rows and columns of a matrix, individual elements, function arguments) is pervasively supported in R. Matlab doesn't even have a decent approximation of a hashtable.
-- R is not quite as insistent on being its own universe. For example, you can write R scripts invokable directly from the shell, without jumping through awful "expect" style hoops and waiting 30 seconds for system startup / license server failure every invocation. For reproducible analysis you really want a build tool ( http://archive.nodalpoint.org/2007/03/18/a_pipeline_is_a_mak... ); so having your analysis scripts being callable by Make or SCons is a no-brainer.
-- Speaking of reproducibility, much of the "reproducible research" movement (which basically says, "hey, maybe scientific data analyses and papers should be done with version control, build automation, and maybe even testing, like software people have been doing since forever") is centered around R. I'm currently doing a project with "knitr," an R library that helps writing reproducible reports; if I want to talk about a particular graph or cite a p-value, I don't manually copypasta the data into a word processor; I write in my document the command to compute the value or plot the graph, and it gets updated whenever I render to a PDF. That ensures that results keep track of any changes in dataset or analysis.
-- The R community in general is more frank about its shortcomings and limitations, which might only be possible in a free software project. For both systems, you can say, word for word, that "there are a lot of awful decisions that (R/MATLAB) inherited due to (S/MATLAB)'s 1970s origins in (John Chambers/Cleve Moler)'s attempt to build a useful sort of interactive shell over Fortran numerics libraries, which it turns out should not be what you build a real programming language on top of." The difference is that the R folks will talk openly about the ways in which R sucks, but you won't get any such acknowledgement from the Mathworks.
-- R-help is both more active and contains smarter inhabitants than comp.soft-sys.matlab. Similarly for R/Matlab questions on StackOverflow.
-- R has a much better packaging system. Actually I should change the emphasis: R even has a packaging system. Libraries install from CRAN/Bioconductor/Rforce with one command, and installing them doesn't tromp all over the global namespace. The code is much higher quality than you find on the Mathworks File Exchange; most of the time I look on the File Exchange anything that looks like it solves my immediate problem hasn't been updated for 5 years and no longer works. CRAN on the other hand has maintainers who will remove packages that stop working. Consequently, people take more ownership of their packages.
-- ggplot2 is the best library in any language for taking your data and making a useful 2d plot out of it. I've written hundreds of lines of Matlab to build graphics that are a couple of phrases in ggplot. On the other hand, Matlab is better at 3d graphing (which I hardly use) and interactive graphics (but there's a lot of people attacking that on the R side.)
-- Simlarly I've written hundreds of lines of Matlab to do data manipulation operations that are like breathing air with Hadley's other great library, plyr.
-- Downsides? R is somewhat slower (if you want to compare two laughably slow languages; we're talking roughly CPython vs Ruby). Matlab has a better IDE with better debugging facilities. Depending on your field you might have more colleagues that are familiar with Matlab (true for engineering, definitely false for statistics.) R's online help is harder to navigate, which lends it a somewhat more difficult learning curve. Actually creating a package to distribute your code is pretty hard to figure out.
> R is somewhat slower
Actually it's MUCH slower, up to a point of being entirely not usable for very large datasets (even ~100GB). True, much of MATLAB speed comes from using highly optimized BLAS (Math Kernel Library by Intel). But not just it. R lacks JIT optimization and numerous attempts to add it were unsuccessful. In fact it's so bad that Ross Ihaka, one of R's creator, proposed to
"simply start over and build something better". See http://xianblog.wordpress.com/2010/09/13/simply-start-over-a...
They're both designed around completely in-memory arrays, which are passed around by-value with a copy-on-write scheme.
For R there is the bigmem package for mmapped arrays. And the "compiler" JIT packace is included since R 2.13.
I've seen that link before. See above re: one group's willingness to talk about the shortcomings versus another organization's preference to paper over it with marketing.
"They're both designed around completely in-memory arrays, which are passed around by-value with a copy-on-write scheme." True, but that doesn't invalidate my point. The datasets I'm typically working with are quite large 300GB-1TB (I have 2TB ram on my main server). I've tried both R and MATLAB and R has been a disaster. Even to plot say 10 million points on a graph is a pain.
> R is currently the lingua franca for academic statisticians. New methods papers, textbooks, and toolkits are much more likely to ship with R libraries and implementations then Matlab or anything else.
Is this the same in industry? I've heard a professor say that SAS is more widely used.
"I want to partake in all the goodness the modern internet has to offer... but only on my terms!"
There was a time when NoJS plugins and the likes were valid (neccessary). These days you're crippling your browser experience more and more. Which is entirely your perogative of course, but then don't make sweeping complainst about things not working.
What did you really expect? Them to run your R code server side?
a) This wouldn't allow the same snappy response, you'd have constant page refreshes.
b) why should they overload their servers, and come up with an inferior solution to serve some self entitled anti javascript snob.
I want to be able to say that I assuredly KNOW what my computer is doing at all times.
There was a time when JavaScript wasn't minified/obfuscated, and the libraries weren't so huge as to require expert knowledge of exactly what was happening under the hood.
I think your claim, that disabling JavaScript is merely invalid snobbery, is as silly as you think my idea of "crippling my experience (and by corollary everyone else's)" is.
You make the assumption that I want the world to bend to my will, but I don't. All I'm pointing out is that the rich experience doesn't degrade gracefully, to a no-script experience. It doesn't even tell you that you need JS enabled, you should just know that already.
I don't mind page refreshes, and I'd put forward that the type of person browsing without scripts enabled would already notice that every other page behaves that way, when scripts are turned off.
If you're going to take the time to emulate the behavior of a statistics application in JavaScript, what's it take to also ensure that users get a similar experience when not enabling scripts? If a user is STUCK with an inferior browser/experience, it's not always their fault. The point I'd like make is that the answer is generally "tough. you're SOL. go get a better computer, and figure it out. try again, from somewhere else."
You make the assumption that I'm being a snob, but as part of testing out any new website, I'll explore how it reacts with no styles, no images and no scripts. You tell me I want an inferior solution, and then call me the snob? How does that work? We'll probably never see eye-to-eye on these tidbits of taste-making, but maybe I'd accept your criticism a little more willingly if you called me pedantic, a paranoiac, or a browser-Luddite. Either way, I'm not taking your contrary opinion personally, but my side of the coin deserves some representation on this site.
I know my opinions about flashy user experience go against the grain, and clash with the opinions of the JavaScript clique here, but I think this kind of criticism belongs in the conversation, even if it gets shouted down. Especially for bigger, well-established learning websites like oreilly.com.
R is first and foremost a language for statistical computing. You really aren't going get much out of it without working on some interesting data/stats problems. Plus for most hacker types I think being able to play with the statistics you're learning about with R can be a great learning aid.
However not only is it beneficial to learn stats with R, it is imho dangerous to learn R without some stats. There's already too much research being published with were 'p-value' means "the thing that t.test() output that I was told needs to be in the paper".
Because R lets you play so freely with stats I find it a great tool to gain greater intuition about certain mathematical principles, but there is a temptation to let the tool do the work and the thinking for you.