Scala on LLVM

ErikCorry · on June 6, 2011

I am very sceptical about an approach that concentrates on code generation and punts GC to later. My experience is that you can always improve the code generation later, but the early decisions you make about GC will be with your forever. Things like data representation, whether objects can move, the interface to native code, the concurrency model etc. are very dependent on the GC and once you have built a big system that doesn't have a moving, precise GC you have probably boxed yourself in in terms of making a good design for your memory management later.

wbhart · on June 6, 2011

LLVM has almost no support for GC and it is about the biggest problem that just hits you in the face when you use it.

What I've been doing for my own experiments is very naive (and quite possibly broken). I've been implementing an interpreter with all GC done on the "compile side" of the Jit. What this means is that to allocate at runtime it reaches back around and calls functions which are pre-compiled into the interpreter codebase.

The interpreter itself uses linked lists extensively for its symbol table and to hold references to memory that it needs at compile time, and this is just hooked into by Jit compiled code at runtime, i.e. there's automatically an object associated with each allocation done at runtime because there is an object associated with it at compile time which is stored in the symbol table. Once a given object is out of scope, assuming no other object (which is still in the symbol table) holds a reference to it, the GC can then see the object can be cleaned up.

[And just writing this I know this sounds very much like nonsense. After all, once something is out of scope at compile time, what is to stop the GC cleaning it up before runtime? Indeed one has to be careful about this. In my case, the source code itself is connected with the symbol table. The source code is itself represented as a linked list and the symbols in it are the symbols in the symbol table. Until the source is relinquished, the part of the symbol table that is currently in scope is not relinquished. And of course the source is not relinquished until after it has been "evaluated", i.e. Jit compiled and executed.]

So far I haven't managed to trick it, and it is very efficient. But then again I'm using Boehm-Demers-Weiser which is possibly so conservative that nothing is actually getting cleaned up in the examples I am running. I simply don't know.

I do know that this is a technical kludge and (I think) I can see it isn't going to work for a compiler, only for an interpreter.

grandinj · on June 7, 2011

LLVM in fact already has a VM available.

See http://vmkit.llvm.org/

I don't know why the scala-over-llvm authors didn't simply build on that, it would have been way easier.

aphexairlines · on June 7, 2011

Have you looked at how GHC provides a GC for its LLVM backend?

wbhart · on June 7, 2011

No. I did read an article about the GHC LLVM backend and it mentions the Cmm language (a variant of C--). But there was no mention of GC.

In fact, this thesis:

http://www.cse.unsw.edu.au/~pls/thesis/davidt-thesis.pdf

seems to state that Cmm implements almost everything but the runtime aspects of C--. In particular GC and exception handling are done more directly by Haskell without the need for direct support in the back end.

I do know some hooks were added to LLVM to support Haskell, e.g. the calling conventions used by GHC.

If you have any references regarding the GHC garbage collection, I'd be interested in seeing them. Of course I can also dig some up for myself.

I assume they do something clever.

(Pretty much everything that is clever that is missing from Scala, from a mathematician's point of view, is in Haskell or Lisp I suppose. Shame, because most mathematicians refuse to learn something as mind bending as Haskell.)

Edit: OK, now based on what I do know, I'm going to take a guess at this. The GHC GC is written in Haskell? Now that's gonna bake someone's noodle.

eru · on June 7, 2011

Lots of Lisps have GCs written in those Lisps, too. But I guess it's harder for Haskell, since you can not just "not cons" and mutate in Haskell.

rickmode · on June 6, 2011

I'll totally jump back on to the Scala bandwagon once it breaks free of the JVM.

I'd love for Clojure to make the same jump to LLVM however it relies far more on the host platform's libraries (the Java/C# libraries).

jherdman · on June 6, 2011

I'm actually a little confused by this. Isn't one of the major benefits of Scala, Clojure, JRuby, et al, that it is hosted on the JVM?

wbhart · on June 6, 2011

Yes, it is. And I suspect the Scala people aren't advocating that it move away from the JVM.

However, I can definitely say, should a Scala on LLVM become available with trivial FFI with native code and close to compiled C performance even for the Scala interpreter (this is definitely possible), I would heartily switch to Scala from C.

I absolutely love the idea of Scala on the LLVM back end. That opens up a whole new world to me which currently doesn't exist.

I've recently been trying to design my own language on the LLVM Jit. So far, many of the design decisions I am making are very similar to, or the same as Scala, even though I am familiar with many different languages.

What's missing from Scala for me personally, is precisely what is proposed in this paper.

runT1ME · on June 6, 2011

Functional and hybrid functional languages benefit greatly from a highly tuned GC. I don't see LLVM supporting something on par with the JVM's collector anytime soon so I think you'll lose a lot of the speed you see with Scala on the JVM unfortunately.

aphexairlines · on June 7, 2011

The flip side is that the Scala developers can provide a GC tuned to Scala's use case, rather than the JVM's focus on Java's use case.

cageface · on June 7, 2011

Absolutely. And a lot of Scala's warts come from its transparent Java compatibility. As much as I like Scala, if I'm going to throw out the one thing that makes it more practically useful than any other functional language I might as well just use a cleaner and more elegant language like Haskell in the first place.

wbhart · on June 7, 2011

Haskell has some great features, such as type classes (the power of these has not been fully realised).

In other cases I have mixed feelings. The monad business is sometimes absolutely brilliant (Parsec is a good example) and at other times tremendously painful (dealing with simple IO).

Other features are just a pain in the neck. Very strict type checking means that using types that depend on parameters that are integers (e.g. m x n matrices) can be a real bugbear. It's apparently possible, but requires some deep magic and incantations (which I don't know).

As much as I personally love Haskell, with its elegant syntax which makes better use of white space than just about any other language, it is really sometimes a head screw to get your mind around its more intricate features. As a result, people in my field (mathematics, specifically number theory) by and large won't touch it.

Scala on the other hand is familiar, at its core. Anyone who has used a Computer Algebra System and done some basic programming with it can get somewhere with Scala. It makes a handful of pragmatic choices whilst retaining the theoretical soundness of the language. It also offers many advanced features which you don't know you need until you've been programming for a few years.

What has been lacking, however, is an efficient and simple way to interface with the many high quality and high performance native mathematical libraries that are out there.

Python has filled this void. But it's terribly slow for computationally intensive stuff. Its interface with C is dreadful, leading to solutions like Cython, where you lose the interactivity of the Python interpreter. And everything is left to runtime in Python, so there's no theoretical soundness. And from a theoretical point of view, Python makes easy things trivial and hard things really hard. It's an extremely elegant language which looks a lot like the mythical mathematical pseudo-code language. But beyond a certain level of complexity, you very quickly get into hot water.

Aldor had a lot of things right. But no one will use it because of its not quite open license. So there's just a massive void there which is neatly filled by a Scala on LLVM. It cannot come soon enough for me.

Having said all that, I did once believe that Haskell was by far the best language. It's the only language I ever used where I thought, "oh, finally I am using a grown-ups' language rather than a toy". That was quite an epiphany. But it was followed by the dismal realisation that absolutely no one cared that it was so great and that this was because some of the things that are so natural in an imperative style become a lot like solving a very difficult problem in Haskell, unless someone already solved it for you.

cageface · on June 7, 2011

Unfortunately I have to agree. Haskell is a beautiful research language, but not something I want to use for real production work either. Scala is more pragmatic and, IMO, useful. If a workable Scala emerges on LLVM I'll certainly give it a shot.

I actually like OCaml better than Haskell or Scala but sadly it seems to be losing momentum.

wbhart · on June 7, 2011

I just couldn't get past having different operators for floating point and integer arithmetic. So O'Caml never passed go, for me. Imagine a mathematical system in O'Caml with polynomials, matrices and many other mathematical objects that you want to multiply. It'd be a zoo. At least Scala allows operator overloading!

cageface · on June 7, 2011

That is pretty ugly. Really as much as I like the current generation of FP languages I think we still have a ways to go before we have a serious contender for a mainstream language.

swah · on June 6, 2011

Were there compromises in Scala's design because it targetted the JVM?

mike_esspe · on June 6, 2011

Java's type erasure is bad for matching List[SomeType], because information about SomeType is erased. This is my greatest nuisance of JVM with Scala.

bobbyi · on June 6, 2011

Scala had a CLR version from pretty much the beginning, so I don't think it was designed with the intention of exclusively targeting the JVM.

skrebbel · on June 6, 2011

The current CLR version is definitely not at production quality like the JVM version, and afaik it never was.

At a recent talk in the Netherlands, Martin Odersky also basically admitted that there's still a fairly long way until the CLR version is really at production quality, and even then it will probably keep lagging behind the JVM version.

So surely you're right about intention, but that's also about all it is.

vukk · on June 6, 2011

No lightweight threads, one has to resort to an evented approach (and lose the fault tolerance etc. afaik).

gtani · on June 6, 2011

The actors- hotspot tie is pretty big

http://stackoverflow.com/questions/2987408/actors-in-scala-n...

riffraff · on June 6, 2011

I am aware of one that is related to compatibility with java more than the JVM itself: no multiple dispatch (at least, this is what martin odersky said to me when I asked him why they are not in scala)

smcj · on June 6, 2011

The compiler can only optimize simple calls in tail position, so recursive algorithms can overflow the stack, because the JVM lacks the support for any optimization here.

No good IO/File API (because it needs support from the JVM for certain operations, which will only be added in Java 7).

No good Unicode support (it uses Java's java.lang.String).

Type erasure (although the consequences are not as bad as Java's).

Java's reflection API doesn't agree with Scala code sometimes.

jbooth · on June 6, 2011

java.lang.String is UTF-8 by default unless you call toBytes() with a different encoder.. how is this not "unicode" enough? (Honest question).

Type erasure is a huge pain though.

stephenjudkins · on June 6, 2011

java.lang.String uses UTF-16 internally. It's a wrapper around an array of 16-bit chars. Thus, it's possible for one to create a String which is NOT valid unicode by abusing surrogate pairs.

Thus, the JVM has no native datatype that represents a "valid unicode string". This is unfortunate, because if java.lang.String did enforce this it would let us make some helpful assumptions.

jbooth · on June 6, 2011

Well, every single way to read/write that string to bytes is unicode by default, unless you go out of your way to plug in a different encoder.

What are you trying to do, export a pointer and write the raw bytes to some destination while assuming it's correct unicode? If you're doing something that low level, it's always possible to corrupt your data and have invalid unicode, just set an invalid byte/rune somewhere in that byte string. Direct memory access always throws guarantees out the window.

It is annoying that getBytes() has to allocate and fill a byte array because of the mismatch between char/byte, but you can work around it when necessary and that's not really related to "not being unicode enough", if anything it's "too unicode" with the insistence on the char type for internal structure.

stephenjudkins · on June 6, 2011

No, one of the constructors for String takes only char[] as a parameter. You can pass in an arbitrary array of chars, even invalid UTF-16.

You're correct that well-written code should never do this. However, there is no guarantee that some library you're using doesn't. You can never assume that 'new String(oldString.getBytes("UTF-8"), "UTF-8").equals(oldString)', which has some unfortunate side-effects if you're doing anything involving serialization and equality.

I agree that Java's String API is generally quite well-designed, but the ability to access the raw UTF-16 is a very big leak in the abstraction.

Confusion · on June 7, 2011

If that ability was lacking, other people would be complaining about it. Abstractions should not prevent you from accessing the bits underneath: they should make it unnecessary. Which they never completely succeed in, because there are always fringe use cases you didn't foresee.

wolfgangK · on June 6, 2011

I would have sworn that it was utf-16 (which is still Unicode enough btw) .

jbooth · on June 6, 2011

Java chars are 2 bytes, but the default String encoding when serializing to/from byte[] arrays and streams is UTF-8 (on unix at least, may vary on windows).

jbooth · on June 6, 2011

Uh, not sure why this comment was controversial?

skrebbel · on June 6, 2011

If this works out, it basically also means Scala->JavaScript through emscripten (https://github.com/kripken/emscripten).

Slow, probably, and probably with nearly impossible interfacing with the DOM and JS libraries, but still - pretty cool

smcj · on June 6, 2011

I'm really interested in that idea. That would create the possibility to clean up the remaining Java cruft like "every object is a monitor", clone vs. Cloneable, ...