R for the JVM: 60% complete, help wanted!

epistasis · on Dec 20, 2011

The primary reason to use R is the large number of stats packages, which are written in a mixture of C, Fortran, and R. I don't see mention of JNI anywhere, but hooking into Fortran and C seem like the most difficult but most essential piece of making this useful.

Wilduck · on Dec 20, 2011

I agree. The reason I'll drop into R instead of python is often that I need to do some complicated analysis like adaptive multivariate integration over hypercubes (http://cran.r-project.org/web/packages/cubature/index.html) or general maximum pseudolikelihood estimation for multistage stratified, cluster-sampled, unequally weighted survey samples (http://cran.r-project.org/web/packages/survey/index.html).

There are currently > 3500 packages in CRAN, many of which can be found nowhere else. Being able to access java libraries is great, but I an access java libraries from all sorts of different languages. Until it is seamless to install and use these CRAN packages, I'm sticking with the standard interpreter.

bedatadriven · on Dec 20, 2011

Yeah, there are some killer packages that won't be able to run on Renjin without porting their C sources to java/R. That's a bummer.

But others -- like the survey package -- are pure R and run well on renjin. (Just do library(survey, lib.loc='/path/to/R/library') )

The central goal at this point is to support embedding packages like 'survey' in web apps or larger java apps. If you're looking for a seamless user experience for ad-hoc analysis, we're still quite a ways off!

aardvark179 · on Dec 20, 2011

Looking at that project page is there a reason you went for an interpreter rather than compiling to JVM byte code directly? Obviously it's a little harder if you're targeting JVMs <v7 as you don't have invokeDynamic to play with, but most of that work will be common to an interpreter or compiler. On the project I'm working on we haven't even implemented an interpreter, the REPL simply compiles to class files that are then loaded to evaluate them.

bedatadriven · on Dec 20, 2011

Mostly inexperience :-)

There are a few features of the R-language that make direction translation into byte code daunting for a muggle like myself:

1. Computing-on-the-language: R code expects to be able to access and modify the AST and frame of itself, its caller, and other closures. 2. Impure call-by-need argument-passing semantics.

The compiler that's in the trunk is experimental but evolving fast, I think the next steps will probably to start compiling simple but performance-critical basic blocks to byte code at runtime, and then slowly expand the scope of language that can handled from there... (Expert advice welcome!!)

aardvark179 · on Dec 20, 2011

Ah right, those are going to make things fun. :-) I think in your position I'd write a compiler, but keep the AST associated with the byte code and back off to an interpreter if the AST is modified, maybe recompile after enough calls without further change. The call-by-need arguments don't sound too bad, but could take some thought on the memoisation strategy, I think I'd do it at the caller and pass those structures through, but I'd want to think about it.

I'd hardly count myself as an expert, but I think the best win we've had is in thinking carefully about callsite caching strategies and having a eureka moment about just how insanely powerful MethodHandles.exactInvoker can be.

codekiller · on Dec 20, 2011

it is mentioned that the project relies on netlib, I assume this takes care of the JNI part

jacobquick · on Dec 20, 2011

I'm not looking to move things to the jvm anymore, I can't trust oracle with the future of stuff I may need to update and maintain.

Nrsolis · on Dec 20, 2011

What about Incanter? It seems like a perfect opportunity to gather two languages trying to do the same thing into a single JVM-focused project.

S4M · on Dec 20, 2011

Incanter is quite nice and I really like Clojure (disclaimer: I am still a beginner), but it has far less libraries than R. I also think the creators of R did a really good job in making seamless the installation of a package (install.packages(...)) and having lots of functions pretty well documented, so a statistician who is not a programmer can easily do his work and quickly come up with results. AFAIK this is still unmatched anywhere else.

_delirium · on Dec 20, 2011

Perhaps as importantly, R has significant buy-in in the statistics community, so a paper on a new technique will often be accompanied by an R package implementing it; in fact several journals explicitly prefer R packages for accompanying code, because the reviewers are likely to be familiar with how to use it. That's partly due to its semi-continuity with Bell Labs S (http://en.wikipedia.org/wiki/S_(programming_language)), I believe, which was a language designed by-statisticians-for-statisticians. It's fairly hard to replicate that; would require considerable effort to migrate the whole community to a new consensus environment.

bigbird · on Dec 20, 2011

Looks like a very interesting project.

Does the JVM environment have any tools that would facilitate building a better R debugger? That's one area of the R ecosystem that could use a serious upgrade IMO.

bedatadriven · on Dec 21, 2011

Eclipse is a pretty good framework for building IDEs; there's already a set of plugins for R (StatET) that support debugging with the original interpreter.

One of the projects for 2012 is to integrate Renjin into StatET, including a line-by-line debugger. Any takers?

zeratul · on Dec 20, 2011

bedatadriven: R native code is usually slow and always memory hungry. Nonetheless, running R on Google AppEngine is very tempting. Could you give us some idea how the memory usage looks like using Renjin when compared to any other R distribution? Here is example how to measure memory:

http://heuristically.wordpress.com/2010/01/04/r-memory-usage...

bedatadriven · on Dec 20, 2011

Re: performance of R language code, Renjin is a bit faster there than R2.X, and should get faster the more we get into byte code. (Though renjin is currently slower in other areas like giant matrix manipulation)

As for memory usage, I believe object.size() will double-count your input data when it is referenced by the resulting model objects. Better to check memory.profile()

At present, Renjin benefits from the JVM's state-of-the art garbage collection, so you may see some improvements even at present, but I expect the big difference will be once we roll out non-memory-backed stores for R Vectors. Then your input data could be stored in a database and only partially loaded into memory as needed.