I’ve claimed a few times that our futures library provides a zero-cost abstraction, in that it compiles to something very close to the state machine code you’d write by hand. To make that a bit more concrete:
- None of the future combinators impose any allocation. When we do things like chain uses of and_then, not only are we not allocating, we are in fact building up a big enum that represents the state machine. (There is one allocation needed per “task”, which usually works out to one per connection.)
- When an event arrives, only one dynamic dispatch is required.
- There are essentially no imposed synchronization costs; if you want to associate data that lives on your event loop and access it in a single-threaded way from futures, we give you the tools to do so.
This sounds quite badass and awesome. I'm not sure what other language implementations take this approach, but this is clearly an extremely beautiful, powerful, and novel (to me at least!) concept. Before reading this, I thought rust was great. This takes it to the next level, though.
Suffice it to say, similar magic tricks have already been at work to make handling errors not soak up a lot of space, and it'd work well with something like this, I think.
In Rust it's frequently the case that slow compilations are dominated by generating and optimizing LLVM IR. This codegen step (generating LLVM IR) often takes awhile just because we're generating so much IR.
Rust takes an approach with generic functions called monomorphization which means that we generate a new version of each function for each set of generics it's instantiated with. This means that a future of a String will generate entirely different code from a future of an integer. This allows generics to be a zero cost abstraction because code is optimized as if you had substituted all the generics by hand.
Putting all that together, highly generic programs will generally trend towards higher compile times. With all the generics in play, there tends to be a lot of monomorphization which causes quite a lot of LLVM IR to get generated.
As with many aspects of Rust, however, you have a choice! Rust supports what we call "trait objects" which is a way to take a future and put it behind an allocation with a vtable (virtual dispatch). This forces the compiler to generate code immediately when a trait object is created, rather than down the line when something is monomorphized.
Put another way, you've got control over compile times if you're using futures. If you're taking a future generically and that takes too long to compile, you can instead take a trait object (or quickly convert it to a trait object). This will help cut down on the amount of code getting monomorphized.
So in general futures shouldn't make compilation worse. You'll have a choice between performance (no boxes) and compile times (boxing) occasionally, but that's basically already the case of what happens in Rust today.
Do note that with MIR the focus is polymorphic optimizations - reducing the LLVM IR for all monomorphizations of a generic function, at once.
Unlike C++ templates, Rust enforces a single definition with uniform semantics, for a generic type/function (specialization going through the existing trait static dispatch mechanism), so we can take advantage of that to reduce compile times.
Hm. I've heard arguments that C# or Java is slow for multiple reasons, but never because of the minuscule overhead of a virtual method dispatch when using objects behind interfaces (kinds similar to trait objects).
It's interesting that this is seen as significant here. Are we dealing with much shorter timescales, or just being eager to optimise everything?
Virtual dispatch per se is not terribly slow, as long as the branch is predictable by the CPU. The problem is that virtual dispatch prevents the sort of aggressive inlining and interprocedural opimizatios that C++ compilers are known to do. C# and Java JITers get around that via runtime analysis and speculative inlining, but that is done at runtime and eats away some of the precious little time available for optimisations.
Cost of a branch misprediction is 10s of cpu cycles. (1) Measured in gigahertz (10^9 cycles per second).
Time to turn around a web request is, if you're very lucky and have done the work, mainly about getting a value from an in-memory cache at multiple milliseconds (2). That's 1 / (10^3) seconds.
If you're not lucky, 10s or 100s of milliseconds to generate the response.
It seems that the second duration is best case around 10^6 times longer. I would not sweat the first one.
As an example, many real-time systems are often a giant ball of messy asynchronous code and state machines. Futures can help with that, although lately I have found that somtimes the best, cleanest, way to implement a state machine is to make it explicit.
How much do you attribute that to the benefit of creating a high barrier to entry for modifying that code? Could this be summarized as: code that inexperienced devs can't understand, stays performant because they can't figure out how to change it?
I assume at least part of it is that "zero-cost abstractions" is a fairly objective and boolean metric to calculate. "Is this performance impact significant enough to worry about?" would probably result in a lot more bikeshedding.
A tremendous amount of effort has gone into the CLR towards optimizing interface dispatch, because at one time it was slow. Interface dispatches are cached at the call site to avoid real virtual (vtable) dispatch, just like a Smalltalk or JavaScript VM would.
Yeah, using closures in zero-cost abstractions (like iterator or now future adaptors) does make compilation slow. LLVM isn't used to this kind of code, and takes time optimizing it.
However, now that we have MIR we can optimize this on the Rust side first, before the type info has been lost. This was one of the reasons MIR was added :)
I'm... curious about why you think LLVM isn't used to the sort of code that results from these sort of zero-cost abstractions: C++ code is also very aggressive about templates that inline away to nothing, and modern C++ especially has many similar abstractions to Rust (cf. closures, and the range proposal). Rust isn't special in that particular regard.
Maybe you mean LLVM is very low-level and thus it sees everything in the abstractions, which obviously takes time to sort through (and is especially wasteful to do multiple times for different monomorphisations). Thus, using MIR to optimise at a higher level can simplify code more efficiently before it hits LLVM.
I believe he meant, lots of closures. Which is not normal in C or C++, but is the normal in Rust because they are Zero-Cost.
Without knowing anything about LLVM's internals, I would assume it doesn't anticipate so much closure chaining, and therefore doesn't leverage the fact that they are so easy to inline.
Lambdas in C++ are almost identical to Rust ones (each closure is a unique unnameable struct containing the captures, with an overloaded operator()), in fact, the current Rust scheme was explicitly inspired by C++11 closures. Historically (C++98), it's true that not much code used closures, because they didn't exist at a language level, but the modern language uses them much more aggressively, even pushing hard on making them more and more flexible (e.g. auto/generic closures in C++14). For instance, sort[1] can take a closure which is more efficient that way than a function pointer, and easier to use than a manually implemented custom type.
Also, closures are just structs with functions that take them as arguments, so if the compiler can handle those, it can handle closures.
Lambdas in modern idiomatic C++ are also essentially zero-cost, or at least they're supposed to be - they're not heap-allocated, and are strong candidates for inlining.
I’ve claimed a few times that our futures library provides a zero-cost abstraction, in that it compiles to something very close to the state machine code you’d write by hand. To make that a bit more concrete:
- None of the future combinators impose any allocation. When we do things like chain uses of and_then, not only are we not allocating, we are in fact building up a big enum that represents the state machine. (There is one allocation needed per “task”, which usually works out to one per connection.)
- When an event arrives, only one dynamic dispatch is required.
- There are essentially no imposed synchronization costs; if you want to associate data that lives on your event loop and access it in a single-threaded way from futures, we give you the tools to do so.
This sounds quite badass and awesome. I'm not sure what other language implementations take this approach, but this is clearly an extremely beautiful, powerful, and novel (to me at least!) concept. Before reading this, I thought rust was great. This takes it to the next level, though.