Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: We are trying to get tail calls into the WebAssembly standard
625 points by apignotti on July 12, 2022 | hide | past | favorite | 290 comments
WebAssembly is a modern bytecode supported by all browsers and designed to be a compiler target for a wide variety of programming languages.

To effectively support some forms of Functional Programming support for tail-calls has been proposed as an extension to the WebAssembly standard.

This proposal has reached Phase3 of the standardization process years ago, but has since stalled.

Phase3 is known as "the implementation phase" and the prerequisite for advancing the proposal to Phase4 is to have support in two different browser engines. V8/Chrome support has been available for a long time, so another engine is required.

To unblock this situation we have contributed full support for WebAssembly Tail Calls to JavaScript/WebKit/Safari. The PR is available here:

https://github.com/WebKit/WebKit/pull/2065

An in-depth article about the challenges of implementing this feature is also available. This is intended both as documentation for our contribution, but also as a general explainer about how tails calls actually work, with a particular focus on stack space management.

https://leaningtech.com/fantastic-tail-calls-and-how-to-impl...




Sorry if I miss something obvious, but how is this not solvable by the compiler?

I'm a huge functional programming evangelist, but high-level stuff like this does not belong in a low level language bytecode like WASM.

Wasm should only care about two things: Security and Performance.

With the standard blowing up like crazy we'll get neither. Worse, we'll cemenent the current duopoly of browser engines, because we'll make it (again) so complex that no-one can create an alternative.

We shouldn't have GC, Exceptions or Tail calls in WASM, as long as the compiler can provide them.

What we should have is multi memory support and memory read permissions, because without them WASM lacks basic security primitives.[1]

Reference types maybe. But I've yet to see a convincing argument as to why they are absolutely necessary.

Everything else (as long as it can be done by the compiler) is just bloat.

1. https://www.usenix.org/conference/usenixsecurity20/presentat...


Tail-calls is fundamentally something that the compiler _cannot_ solve. Trust me, if there was a way we would have avoided ourselves all this work.

The issue is that to avoid stack blow-up you need the engine to recycle stack frames. You might argue that this could happen implicitly in the VM, with all the calls in tail position being automatically converted to tail-calls. The problem with this is that in WebAssembly (and JavaScript) the call stack is observable via the .stack property of thrown exceptions.

Since an implicit conversion would be observable engines cannot simply optimize the problem away, and that is the reason why new opcodes (return_call/return_call_indirect) are actually required at the WASM level.

For the specific case of direct calls (return_call opcode) the compiler could solve the problem by inlining, with some luck. But for the case of return_call_indirect there is no other possible solution.


Heya,

(1) Thank you for implementing this in JSC!! I hope they take it, it makes it into Safari, and the tail-call proposal advances.

(2) I don't think you are exactly right about the call stack being observable via thrown exceptions. There's no formal spec for the v3 exceptions proposal yet, but in the documents and tests, there's nothing that would change in WebAssembly core to make the call stack observable. (There's no ".stack property" that would be added to Wasm itself.) It's true that the proposal amends the JS API (but only the JS API) to describe a traceStack=true option; from Wasm's perspective I understand that's just an ordinary exception that happens to include an externref value (just like any other value) to which Wasm attaches no special significance. The Web-based engines can attach an informative stack trace if they want, but there's no requirement preventing frames from having been optimized out. The non-Web engines won't have to think about this.

(3) I think the real reason that a Wasm engine can't implicitly make tail calls proper is that the spec tests forbid it, basically because they didn't want the implementation base to fragment by having some engines perform an optimization that changes the space complexity of a program, which some programs would have started to depend on. (The spec tests say: "Implementations are required to have every call consume some abstract resource towards exhausting some abstract finite limit, such that infinitely recursive test cases reliably trap in finite time. This is because otherwise applications could come to depend on it on those implementations and be incompatible with implementations that don't do it (or don't do it under the same circumstances.)")

But the issue is much weaker than "call stack is observable" -- it's more like "infinite recursion must trap eventually, but it can be nondeterministic when."

More discussion here: https://github.com/WebAssembly/spec/issues/150


> Implementations are required to have every call consume some abstract resource towards exhausting some abstract finite limit

If an implementation really wanted to, they could get around this by incrementing a "function call counter" that traps at, say, 2^64, rendering it effectively moot.

I feel like these kinds of situations are where being practical might make more sense than being mathematically precise. Something like "each function call must consume at least N bits of memory" or something concrete like that. Or heck, "implementations may not perform tail-call optimization" or even "implementations must be able to reconstruct the full logical call stack at any point".


> each function call must consume at least N bits of memory

This is going to be a problem for any long-running recursive program. Why would we mandate this?


You don't want it to be implementation-specific whether tail call optimization is performed or not.

The nightmare scenario is that you spend months writing a program that works great in browsers A and B, but when you roll it out to prod it immediately fails in browser C, which doesn't perform TCO.

You only want to write code that depends on TCO if you can get a guarantee that TCO will be performed. This is the motivation for the clang:musttail attribute in C/C++ (I implemented musttail in Clang): https://clang.llvm.org/docs/AttributeReference.html#musttail


This might be a silly question, but why is that a nightmare scenario? The exact same argument could be made for any compile-time optimization. If one compiler inlines a function where another doesn't, or unrolls a loop, or performs better constant propagation, any of those could impact the resource usage between the two compilers, leading to exactly that same scenario. But I wouldn't want to forbid improvements altogether out of a desire for consistency.

To me, I'd see a distinction between code that requires a specific optimization and code that could benefit from that optimization. If the standard forbids an optimization to avoid the former, it also removes the latter.


That is exactly why tail call "optimisation" is not, in fact, an optimisation.

Indeed much of this discussion, including the title and text of the original post, carefully avoids using the word "optimisation". Somehow it got introduced at some point in these comments.

To flip it the other way: consider a normal (non-tail-call) program and choose a local variable in a routine that happens to be called quite a lot. Replace it with a list that you append to every time you call the routine, even though only the last entry is ever examined. The program will leak memory, potentially quite quickly, and eventually crash. Is it fair to say it's just less optimised than before? I would say it's worse than that: it has an actual bug.

That's exactly the situation in a program created with the assumption of tail calls that is then used in an environment without them.


> Replace it with a list that you append to every time you call the routine, even though only the last entry is ever examined. The program will leak memory, potentially quite quickly, and eventually crash.

I had to think on this example a bit, but I believe these programs exist and are quite common. This is an accurate description of any program in which memory is allocated and never explicitly freed. Some languages include runtimes that perform garbage collection, but until the garbage collection runs, the program is leaking memory and would crash without the garbage collection.

Which is to say, I think I agree with you, but had to chew over the example a bit. That it isn't a quantitative difference in speed or memory usage, but a qualitative difference in the types of programs that are acceptable in a language. Thank you for it.


> Is it fair to say it's just less optimised than before? I would say it's worse than that: it has an actual bug.

Yes, a program that depends on tail call optimization, despite having no guarantee of tail call optimization being performed, has a bug.

That doesn't mean that tail call optimization is not an optimization. In languages that allow, but do not guarantee, tail call elimination, it is an optimization performed by the optimizer that makes a program run more optimally than it would have otherwise. There is a long history of referring to it as an optimization, for example in the docs of LLVM: https://llvm.org/docs/LangRef.html#call-instruction

I don't think it makes sense to redefine the word "optimization" such that it excludes examples where a program could erroneously depend on a certain optimization being performed.


OK, if you prefer to be precise, there are two situations:

(1) In languages (or runtimes) where tail calls are not guaranteed to be elided, tail call elision is an optimisation. In that context, you could quite reasonably describe it as tail call optimisation. Programs that are written in these languages and rely on tail call elision are buggy.

(2) In languages (or runtimes) where tail calls are guaranteed to be elided, tail call elision is not an optimisation. In that context, you cannot reasonably describe it as tail call optimisation. Programs that are written in these languages and rely on tail call elision are not buggy (at least, not for that reason!).

This conversation is about making tail call elision a guaranteed feature of WASM. Therefore, in this context, it is not reasonable to describe tail call elision as an optimisation.


Most optimizations only offer constant factor improvements. Tail call optimization is the difference between O(1) and O(N) stack size.


Tail calls can be optimized at the source code level too. When I first read about them decades ago I thought it was a stupid notion due to me not understanding how a good compiler can optimize them away. Then I wondered why certain people felt the need to write code that way. I still wonder that.


I wrote an article about my motivation for it: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...


Thanks for that. Having written a threaded interpreter before that immediately made sense. It does seem like you used some other hints to the compiler as well so this would fall under "experts only" type of code and that's OK. I still wonder how well good old "goto" might do, but once you move away from a huge switch/case to individual functions I can see how it might help a bit.

I still don't understand why some people seem to like the idea of iterating simpler things with recursion using a tail call.


> The exact same argument could be made for any compile-time optimization.

No.

For other optimization, the difference is the speed of execution. For tail call optimization, the difference can be normal execution and stack overflow.


Any optimization that removes an unnecessary stack allocation could also be the difference between normal execution and stack overflow, due to decreasing the size of the stack frame.


But the effect is much more likely in tail recursive cases.


The reason for the existing mandate is to be a problem for recursive programs, so it is a feature, not a bug, of the proposed refinement that it more reliably achieves that.

(Specifically, to prevent one implementation getting an optimization for them that leads people to rely on it, making code that blows up in practical use on other implementations, despite both being nominally correct implementations of the same spec.)


Try writing a loop over all 64 bit values. The computer will break before the loop finishes.


It is implicitly already mandated, depending on how finite your finite limit is.


> The spec tests say: "Implementations are required to have every call consume some abstract resource

It's a nice detail for you to highlight, but surely that exists in the test spec specifically because the opcode spec doesn't mandate tail-call handling. If the latter did (a la Scheme), then the test spec would be updated to no longer have this note...?


The proposed spec is to have an explicit tail call instruction separate from the regular call instruction for which resources must be used.

I think this would be the correct design even if tail calls were desired from the start. It makes for a better debugging experience and allows language implementations more control. (A trend in some functional languages is to require explicit annotation when tail recursion is desired, both to improve debugging for regular functions and to allow the compiler to check that you don’t accidentally stop being tail-recursive)


Fair enough. Sounds more like an issue with .stack being an observable property though.

I've long suspected exceptions making it into WASM being it's million dollar mistake, and this further confirms it.


I haven't messed with .stack, but on first impression I agree with you that it seems like a mistake.

Even if it wasn't observable though, I think guaranteed tail-call elimination via either a new opcode or having it be a required property of regular calls in the spec (we already missed that ship, of course) is important. Without it, proper compilation and execution of some languages on wasm would depend on an otherwise invisible property of the engine - that is, tail call elimination isn't just an optimization, it's a critical aspect of language semantics that needs to be guaranteed to be relied on.


If it is a mistake depends on the point of view. What seems to be more important: exceptions of tail-call elimination? Most modern languages use exceptions in one way or another, conversely very few use the later feature. From the point of view of existing software it is way more important to optimize VMs for exception handling than for tail call elimination.


Well wasm doesn't have exceptions right now either! The exceptions proposal is also in phase 3, like tail-calls. Right now wasm just supports traps, which can't be caught or inspected from within wasm code, and don't actually need to record stack frames (but the JavaScript host does in web browsers). I'm not sure what benefit exposing the wasm part of the stack for traps gives the JavaScript host side except perhaps easing debugging? (though that is important, I think there are other valid solutions for that)

To be clear, I do want wasm to support the major language use cases, and I think implementing exception support is a good (though tricky, see the exceptions proposal) idea.


After reading the discussion, I think I'm much more in favour of adding tail calls, than adding exceptions.

Actually I wonder if one couldn't adapt the tail-call mechanism slightly to also serve as an exception mechanism. I think you could do both by allowing `br` to jump to an arbitrary label previously established. An exception then simply being a "tail-call" that unwinds more than one stack frame, and calls into the "catch".

Not sure how well JITable this would be, but I think an arbitrary backjump and call should be mostly fine in terms of control flow?



Calling into an external JavaScript function, which will then throw a JavaScript exception (as the linked documentation states).

Wasm cannot throw the exception itself, nor can it catch the exception. How exceptions are handled when raised in code called via FFI in another language are a property of the embedding and wasm engine, not of wasm itself, which (as of yet, barring that exceptions proposal) has no concept of them besides unrecoverable traps.


Great thank you for the info


Would be interested in what other solutions you envision. In production error reporting to services like Sentry relies on getting access to a stack and unlike a native runtime there is nothing one could do to create a stack without the engine’s support.


If you control the compiler, you don't need the engine's support for exceptions (though you might want it for performance reasons) - emitting function prologues and epilogues that maintain debugging information like a call stack (or call trace in a ring buffer like some Schemes) would be one way.

I actually did this in my purely functional language, but had it off by default for performance reasons - if you hit an error at runtime, the debugging code would re-execute from the last checkpoint with debugging information turned on to construct the missing debug data, something I could only do because the language is side-effect free.


It's more important for exception handling to be implementable than for TCE to be implementable, yes. But it's more important for TCE to be provided by the virtual machine (or, alternatively, for call and return to be manipulations of user-level state, as on traditional CPUs, rather than magic privileged operations as in Wasm) because you can implement exception handling with TCE but you can't implement TCE with exception handling.


Maybe I'm naive and I surely don't know anything about the WASM spec, but aren't CPUs implementing both exceptions and tail call elimination as gotos (some jump machine code instruction) ? Exceptions might have to pop some frame, TCE doesn't.


To avoid some classes of security issue, and also because it started much closer to asm.js in features, WASM uses structured programming control flow like blocks, if..then, and break statements. Functions can only be called by the call instruction right now, so this proposal is needed to jump to an function body.


Are you saying there is no goto in wasm ? Genuine question I don't have any knoweldge of WASM.


Yes, that is correct.


Plenty of C/C++ code de facto assumes it’s present even though it’s not mandated. Anybody trying a list algo in recursive style is relying on it implicitly.


The point is to run compiled langs like C++ though. Technically x86, ARM, etc don’t directly support exceptions either, but I assume WASM also has to define an ABI between WASM modules, and ABIs usually are exception aware. Or is it some other reason they got added?


> Tail-calls is fundamentally something that the compiler _cannot_ solve

Possibly a stupid question as I haven't given this much thought, but I thought tail call elimination could be used to convert recursive calls in tail position into loops. Could a compiler not do this (like Scala does, for example)?


Tail-call elimination applies to any function which calls another function as the last action.

Turning this into a loop is only possible when the function calls itself, not when it calls another function, and not (naively) when two functions call each other in the tail position.


Understood, but in practice how often does that really happen? (I'm aware of the contrived 'odd'/'even' example always given, but in the real world, I never had anything like that). Even when I wrote more functional code I rarely (if ever?) did that. 98-99% of the time it was the same function calling itself. Is your experience different?


Iteratees are immensely useful in practice and they involve doing this all the time (basically you do stream processing with a source and a sink that are mutually recursive - the source emits an element by calling the sink to process it, the sink processes it and then recurses into the source for the next element. It feels a bit forced if you describe it like that, but it gives you a model wiht lots of natural structure - it's the only approach to stream processing I've found where you can have a value that represents a stage in the middle of processing a stream (not just a 1:1 function but something that does grouping or suspension or the like) in a natural way, that you can then test standalone etc.).


> the source emits an element by calling the sink to process it, the sink processes it and then recurses into the source for the next element

I think I understood that concept but it still seems a bit strange to me. Doesn't that imply a potentially infinite stack of function calls?

I've seen this idea in other functional program examples as well. Instead of having a sequence of instructions, they have a sequence of function calls. Instead of a function returning to the caller, it calls the next function in the program which is analogous to the virtual machine moving on to the next line of code in the sequence. It implies the program's structure and state is actually expressed within the function call stack and its frames. I admit I'm not sure what the purpose of this is.


> I think I understood that concept but it still seems a bit strange to me. Doesn't that imply a potentially infinite stack of function calls?

Yes, which is exactly why tail calls are so important. Otherwise you'd stack overflow after processing the first few thousand elements of your stream.

> I've seen this idea in other functional program examples as well. Instead of having a sequence of instructions, they have a sequence of function calls. Instead of a function returning to the caller, it calls the next function in the program which is analogous to the virtual machine moving on to the next line of code in the sequence. It implies the program's structure and state is actually expressed within the function call stack and its frames. I admit I'm not sure what the purpose of this is.

Well, if you want to express your function as an actual function, in the mathematical sense, recursion is often the natural way to do it. Things like the call stack are an implementation detail; often a natural way to describe a function is "do this, do that, then do the same thing for all the rest of it", and usually the easiest way to translate that into a program is recursion.

It also tends to be more compositional, because you don't have to worry about control flow (or at least, control flow isn't a special case: all your control flow is just function calls). For example if you are calling several nested functions in an imperative-style loop, there's no way to break out of that loop from the inner function. But if you write the same thing as several mutually recursive functions, you can "break out of the loop" by just not recursing under particular conditions.


Thanks, I think I have a better grasp now.


> Understood, but in practice how often does that really happen?

In functional code written where general tail call elimination (or special tail call operations) are available, non- or mutually-recursive tail calls are not uncommon.

Heck, non-recursive (potentially fairly nested) tail calls are common in languages without any optimization for it. Anytime you see something like “return f(x)” (or, because operators are implemented via special methods, even “return x+y”) in Python, for instance, you've got a tail call.

If it isn't direct or mutual recursion, it has to be a pretty degenerate case to blow the stack even without elimination, so elimination is must necessary for those cases. But other cases bene for from not having the extra memory usage and reducing pushing stuff onto and popping it off of the stack.


Happens a fair bit of time when parsing deeply nested structures (e.g. parsing deeply nested JSON structures) using mutually recursive parsers. I've generally resorted to either trampolining or explicit stacks.


This is usually a case where TCE doesn't help, in my experience, because most of the calls to sub-parsers aren't in tail position. This is especially unpleasant when you get bug reports of mysterious segfaults that turn out to be stack overflows...


Yep, a basic security check for any endpoint that accepts JSON: what happens when you hand it 4KiB of '[' characters


> parsing deeply nested JSON structures

How likely is it that the depth of the JSON structure would exceed the maximum depth of the stack? This doesn't appear to be a compelling example to me.

It probably is a shame that it's called tail call optimisation since nobody wants to mandate low level details for an optimisation.


> How likely is it that the depth of the JSON structure would exceed the maximum depth of the stack?

Not that unlikely i'd say, at lest if the stack size is unreasonably small.

For example, on nodejs, JSON.stringify() chokes with a nested array of 10K depth:

  > N=10_000; JSON.stringify(JSON.parse("[".repeat(N) + "]".repeat(N)))
  Uncaught RangeError: Maximum call stack size exceeded
      at JSON.stringify (<anonymous>)
Curiously, JSON.parse() is OK with arrays of 10M depth (and probably more). A difference of over 3 orders of magnitude between functions that seem pretty symmetrical at first glance.

Imagine that you build some software that lets users nest things —e.g., a circuit designer, a visual programming language, a UI builder, etc— and you wanna save these nested structures as JSON, or process them recursively. Pretty natural fit i'd say. Now, as a user, i'd be pretty upset if my magnum opus of a very complex circuit, or giant visual program, or very busy beast of UI art, starts creashing the program when exceeding a mere 10K elements. My computer has gigabytes of memory! And, presumably, there's a lot still available. But nope, a meager stack can ruin all the fun.

I personally think this is a pretty silly limitation to have on software, especially given today's memory sizes.


TCO likely wouldn’t help here, as the parser is probably using non-tail recursion to build the nested arrays. (I haven’t looked at the code, but that’s the natural way to do it with a recursive descent algorithm.)


It’s a security/DOS vulnerability if the object depth can be used to trigger stack overflow.


For example pretty much any higher-order function that wants to do some decision-making and then hand over execution to one of several functions provided will want to hand over execution by means of a tail call, so that it doesn't unnecessarily change stack complexity for whatever algorithm is being run around that piece of code.


One example would be in a parser, as various parsing functions would be invoked recursively until end of input, or until some non-matching input is encountered. In pretty much any scenario where you would use recursive functions, it is quite common to have mutual recursion between multiple functions. Not always of course, which is where the tail call elimination optimization is typically applied, but not being able to freely use mutual recursion to arbitrary depth turns out to be quite an annoying limitation.


State machines can often be implemented with a set of functions that each handle a single state and then tail-call into the function for the next state.


This is just a fancy form of spaghetti code. Don't do it unless you actually need the performance.


State machines are known for a predilection to use `goto` extensively, but that does not make them 'spaghetti code'.

A state machine is, in fact, a well-understood method of applying structure to spaghetti.


The point is that tail calls obscure the transitions between states by spreading it everywhere.


I'm not too clear on how one would implement a state machine without "spreading the transitions everywhere". Can you share a link to some example code showing the format you prefer?

Typically, we'd see the code that implements the logic for each state gathered together. Since this is where the outgoing transitions get decided, the transitions end up being distributed across their originating states.


erlang/otp gen_fsm maybe?

you define a function with a clause per state (and some other args including the incoming message), and each clause returns a tuple with the new state and some other stuff. the loop is part of the framework, and handles generic otp stuff like updating code in a running system.


That doesn't compute for me. You'd need to give an example for me to understand you.


> 98-99% of the time it was the same function calling itself

well if the same function is in the "tail" it would still apply a tail call elimination.

But a lot of the time tail call eliminiation is only needed in some extreme cases. it can eliminate stack overflow exceptions by a big amount. C#'s new System.Text.Json would greatly benefit with tail call ops, because it often has a call chain that is recursive and does not call itself (its mostly like ReadObject, ReadObjectAsync, ReadObject, ReadBla, ReadObject, etc. and if the object has a ref to itself system.text.json often blows up.

In Java and Scala this thing can btw. also happen (and I've already seen it) however at least you can configure the stack size there and thus most often eliminate the problem somewhat.

it's an edge case, but one that would be cool to fix, since it makes code often more reasonable especially parsers.


In languages like Scheme, Haskell, OCaml, and Erlang it happens all the time, often when the user isn't aware of it; even when these languages have special looping constructs they are often implemented in terms of tail recursion, sometimes mutual tail recursion between several anonymous subroutines.

In languages like Scala, Java, C#, and Swift, it basically never happens.


> In languages like Scala, Java, C#, and Swift, it basically never happens.

Scalaz and cats use tail calls extensively and wouldn't be possible without them.


How does that work on the JVM?


Scala only does single method tail recursion and rewrites it into a while loop. Cats and other libraries use a technique called trampolining which basically moves the frames to the heap (or a mixed technique where they do recurse a certain depth on the stack before switching to trampolining).


That sounds like it means that Scala doesn't support general tail-call elimination, so people don't write mutually tail-recursive code, instead transforming things that would naturally be expressed that way into a different structure that Scala does support.


Yes. It can be a major PITA, in certain problem spaces and I think most people don't pretend it's a great solution but ultimately accept it as a limitation of being hosted on the JVM.


In functional languages, recursive data structures are often most naturally walked with recursive functions. Languages like Haskell and ML take this to its natural conclusion and use it systematically. For example, a linked list. Here's how the Haskell "last" function, to get the last item in a linked list, is implemented in the standard library:

    last [x]                =  x
    last (_:xs)             =  last xs
    last []                 =  errorEmptyList "last"
If given a list with a single item, return that item. If given a list with two or more items, call itself with the list minus its head. If given an empty list, throw an error.


This is the scenario I am used to using where "last" is called in tail position calling itself, but I see from the other answers there are definitely valid use cases for non-same function recursion as well


It's useful anytime you can benefit from the performance advantages of replacing a full fledged function call with little more than a jump. One classic example is implementing an efficient VM.


If you’re implementing a compiler for a language like Haskell, then everything is compiled into mutually-recursive tail calls.


If f1 calls f2 that calls f1 why can't the compiler inline f2 into the body of f1? It's not so complicated if the compiler has annotations or other hints to guide it.


I once wrote an interpreter in continuation-passing style, meaning every final line of a function called the next function in the interpreter, passing a continuation and error continuation on.

Such a program never returns until it finishes interpreting, and can make an unlimited number of such calls.

This is a bad idea without TCE.


Why couldn't the compiler inline the whole interpreter?


Because in the continuation-passing style, the functions-calling-functions are circular in many places, inlining would explode rather than halting.


You can turn it into a loop with a case statement inside if you collect the connected component of tail called functions and make those the cases. But it’s gonna be a big function.

You can split it up a bit if you use a trampoline, but then you lose some efficiency.


Though at that point you're basically just one step removed from making your own stack frame.

Which would definitely work, no doubts about that, but it might be nicer to just get it handled at a more fundamental level.


Scala has a weak version that handles self-tail-calls only. It causes significant trouble for e.g. iteratee/streaming libraries in Scala (like FS2) where you always have to use a trampoline.


I don't quite get your point here. Suppose I write a compiler to compile Scheme to WASM. I would have write it so that it does proper tail calls, otherwise it wouldn't be Scheme. What's stopping me? Would the browsers not run the code? Like, yeah, it would change .stack, but who cares?

Same thing applies, I think, to any other language compiled to WASM. C/C++ compilers regularly inline huge amounts of the code when optimizations are turned on, as well as do tail-call optimization. I haven't tried, but I would assume that they do that for WASM just as they do for x86 or ARM or whatever other build target I choose. As long as my users are ok with this (and they presumably are, otherwise they wouldn't turn optimizations), what's the problem, exactly?


> I don't quite get your point here. Suppose I write a compiler to compile Scheme to WASM. I would have write it so that it does proper tail calls, otherwise it wouldn't be Scheme. What's stopping me?

Stack overflows are stopping you, if you implement Scheme calls as Wasm calls.

In Scheme, tail call elimination is not merely an optional "optimization", so the compiler can't opt to not do it when it's inconvenient.


The control flow in WebAssembly virtual machine is quite restricted (it's called "structured control flow" in the spec), you can't jump to code freely. Possibly you could compile scheme with having the whole program in one big wasm-level function and branching between blocks there?


> yeah, it would change .stack, but who cares?

You might enjoy participating in an interoperable standardization process sometime.


I don't mean to minimize anybody's work, I'm genuinely curious about why it can't be done. Like, what's stopping me from making a compiler that optimizes tail calls?

For instance: the C and C++ standards have very strict rules for how to handle floating point math, and compilers aren't allowed to deviate from that according to the standard. Which turns off all sorts of cool optimizations you can do. But of course, all modern compilers implement some version of "-ffast-math" which turns off those rules and allows for the optimizations. It's no longer standard C/C++, but that doesn't mean that switch can't exist. The compiler is a computer program, it can output anything it wants, regardless of what the standard says. Nobody is going to go to jail because you turned on -ffast-math. The code will still run just fine.

So, my question is, why can't you write a compiler with an option that's like "I don't particularly care that .stack changes, do the tail call optimization". A -ffast-math, but for tail calls. Is there a technical reason why you can't do this?


> So, my question is, why can't you write a compiler with an option that's like "I don't particularly care that .stack changes, do the tail call optimization". A -ffast-math, but for tail calls. Is there a technical reason why you can't do this?

Because the stack in web assembly is only observable. It is implicit. You cannot modify it yourself. There are simply no instructions for this. The WASM stack is not present on the WASM heap, like it is for your physical machine.

You simply cannot express tail calls with the instruction set given to you by WASM right now. No hacks are possible.


Is this the reason clojure added the recur form? IIRC, the JVM doesn't support tail calls because Java doesn't need it.

https://clojuredocs.org/clojure.core/recur


> Tail-calls is fundamentally something that the compiler _cannot_ solve.

I don't see why. Compilers are how tail calls are literally always implemented. It's not like there's hardware support.

What makes this impossible?

.

> The issue is that to avoid stack blow-up you need the engine to recycle stack frames.

I mean, what's stopping you from just implementing a trampoline?

.

> The problem with this is that in WebAssembly (and JavaScript) the call stack is observable via the .stack property of thrown exceptions.

Okay? This seems fine to me. What makes this a problem?


> Compilers are how tail calls are literally always implemented.

WASM doesn't have the same set of operations as a typical CPU. It's not something a compiler to WASM can do.


Compilers implement tail calls on architectures where you can JMP with impunity. Wasm is not such an architecture - it deals with calls and returns and call stack as concepts.


Trampolines are a hack that penalizes the performance of any language that uses tail calls a lot, like functional languages (scheme, lisp, Haskell, etc).

You could also handle exceptions manually too, without VM support, but that too would incir considerable overhead.


.stack definitely seems like a mistake in the context of Webassembly. I’d be interested in seeing the justification for it.


Because people want a call stack when an exception is thrown, so they can print it out, just the same way they do in JavaScript and every other language.

EDIT: Also as another commenter mentioned, this is a property exposed by the VM engine to the host environment, not something directly observable from within WebAssembly itself.


> Tail-calls is fundamentally something that the compiler _cannot_ solve. Trust me, if there was a way we would have avoided ourselves all this work.

I have been out of the loop regarding compilers development for a long time but what prevents you from converting your program to CPS and using a trampoline like some LISP still do?


A trampoline is a workaround that wastes memory and time. The idea is to avoid that so that you can use function calls efficiently.


What? Compilers handle tail calls all the time even in languages like C and higher level functional languages.

Its just a jmp? Does the wasm VM not have a jmp instruction?!?


https://www.w3.org/TR/wasm-core-1/#control-instructions%E2%9...

Those are the list of control instructions. WASM seems to be a bit of a misnomer as it is not an assembly language in the more conventional sense. It is a structured language and provides a limited goto in the form of branches (in the section I linked to) which are constrained to a particular scope.

If you compile your whole program so it fits within a single function, then yes you can use this to do universal tail call elimination. Otherwise you are restricted to doing TCE only on auto-recursive functions and maybe mutually recursive functions if you have a single entry point (of the set of functions) and decide to optimize by moving all of them into one function.

Otherwise, function calls are performed using one of the two call instructions which (presently) implement a behavior more like conventional call stack/stack frames. This proposal would add a second pair of call instructions that a compiler can emit which the WASM runtime would then optimize (by not generating new stack frames).


So no self modifying code either, since it seems like I can't just get the address of a function and modify the byte code?


That would be used most often for malware, unfortunately.


> Tail-calls is fundamentally something the the compiler _cannot_ solve.

How does the famous 1977 Guy Steele paper on compilers optimizing tail calls not apply?

https://dl.acm.org/doi/10.1145/800179.810196


My guess is that this is assuming that the compiler can rewrite a return into a jump. That is, these compilers have control over the stack on the machine.


compilers always have control over the stack on the machine: they generate the code which realizes the notion of a stack (and happily usually have dedicated registers to use for that generated code)


Not something compiling to WASM. Nor something compiling to JVM Bytecode.


If WASM and JVM Bytecode have a goto, then the compiler can use it. Anyway, if you've never read that paper i linked to above, i do believe you'll find it mind-blowingly cool stuff. not kidding.


WASM's version of a goto is a branch, it is constrained to a limited scope. This creates a problem for general tail call elimination but not for auto-recursive functions.

Auto-recursive functions can switch to using a branch or using another loop construct, and the compiler can generate that WASM code. The control structures available are described in the link below. Mutually recursive functions where only one is meant as the entry point could be similarly converted into WASM instructions but where the additional functions are, essentially, eliminated and inlined into the original caller (this may not be a good general solution, though). But general tail calls cannot be converted into gotos as you might with a more conventional assembly language since WASM branches cannot go to arbitrary instruction addresses.

The problem is that in a conventional assembly language, if you do TCE then the goto would go to (har har) the same (or nearly the same) point as a regular function call (maybe it skips the first few instructions depending on the calling conventions involved). With TCE and WASM you'd have to, essentially, inline the called function, or a variant of it, in order to be able to branch to the start of it. You cannot branch into a different function or to the start of a different function, you have to call the other function and then you have a single entry point. Which, in WASM as currently specified, means you have to use the current calling conventions which does not permit TCE.

https://www.w3.org/TR/wasm-core-1/#control-instructions%E2%9...


They don't. That is literally the thing.

Edit: I should say, they don't have anything equivalent that can jump out of the method you are in.

Edit2: This is a bit more clear if you consider what it means for JVM bytecode to have a "return" set of instructions. Why does the bytecode need a return, if that is all managed by code that the compiler should handle anyway? You can look at the instructions here: https://en.wikipedia.org/wiki/List_of_Java_bytecode_instruct.... Note that it is the "jsr" instructions that let you do the equivalent of a long jump, and those specifically manipulate the stack. There is a "goto", but it is not valid to have that jump outside of the subroutine you are in.


this paper espouses the use of jump+arguments instead of call. Wasm has no control transfer instruction that doesn't use the stack, nor does it have writable access to the stack pointer.


There are probably WebAssembly-specific things I am unaware of. So, the following is a general "TCO" discussion.

I normally consider TCO something that is a compiler feature. Replace a call to the head of yourself, with a jump (possibly to just after the "pull the arguments from the call stack to where the code wants it" prologue), making sure that the correct locals are present where they need to be.

Well, that's for self-TCO. General TCO is definitely trickier (probably requires juggling stack allocations so as to ensure that the tail-called function has enough space for all that it needs).


Wouldn't it be much more productive to do a campaign to get `.stack` out of WASM?


Throwing in my 2 cents to agree with apignotti - as someone who has implemented a compiler for a functional language that emits wasm, it is not possible to solve this performantly in the compiler.

Because wasm only has structured control flow & no way for user code to modify the wasm stack, there isn't any good way to tail call between multiple independent functions, particularly dynamic function calls. Simple tail recursion of a function calling itself is simple enough and can be implemented in the compiler (and I did), but that's it, and not enough to correctly implement Scheme, for instance.

I also want the wasm standard to remain as simple as possible, but this isn't a very complex addition and it is required for reasonable performance for both functional languages as well as C++20 features like coroutines, as mentioned by the article.


Can you please clarify? apignotti wrote

>Tail-calls is fundamentally something that the compiler _cannot_ solve.

You write

>it is not possible to solve this performantly in the compiler

So is it fundamental or not? Apologies if the question seems direct or rude.


No worries - it depends on what your constraints and context are. It's fundamentally something a wasm-emitting compiler cannot solve performantly in general. You can use trampolines, as the sibling comments mention, but you'll take a heavy performance hit.

If we stretch things too far we'll end up in the "wasm is Turing-complete and thus can run anything" type of territory - if you're doing something performance intensive, say x86 virtualization like I believe apignotti is at Leaning Tech, I think it's fair to describe the problem as unsolvable by the compiler.


Thanks, that's pretty much what I suspected. I guess the issue is that the term (fundamental) is being used in a specific context assumed by experts in that area.


> So is it fundamental or not?

Obviously a Turing-complete system can simulate any other Turing-complete system. For example you could write an emulation of any system you like in WASM, and in the emulator you could have tail call optimization. Thus from context it should be apparent that we're talking about efficient implementations.


'call' leaves return state on the stack, so if you use it to implement 'jump'. that has to be cleaned up. a trampoline is kinda the only choice if you are forced to run on a stack. go ahead and use 'call' as much as you like, but as the useless return addresses (and often locals) pile up on the stack, at some point you just return all the way back (or do a longjmp equivalent if your environment supports it) and start over again.

this is measurably worse than just using jump, and as another posted pointed out, can introduce an O(n) term that doesn't need to exist. (edit: nevermind - if you are going through the forward direction n times then it doesn't change complexity to do a little more n work on the way out)


It might be possible for a compiler to generate code that effectively does trampolining. But that would definitely have a performance cost.


Wasm is too high level to implement your own tail calls. The WASM virtual machine handles the call stack & function calling convention, instead of being something that the code itself is responsible for. This means that the compiler can't implement tail calls; WASM doesn't allow a jump instruction in one function to jump into a different function.

There are a bunch of reasons why WASM was designed to be higher level than native assembly. One is that they want it to be more compact to save bandwidth when downloading via the internet. Another is that they want to make it faster to compile down to native code; one way they do this is that the control flow is more structured, for example it forbids irreducible control flow graphs. There is also the matter of safety. WASM defines several validation constraints that are checked before the bytecode is run e.g.: the target of a jump instruction must be listed in labels list. If WASM were too low level, it wouldn't be possible to do the safety checks they want.


The compiler absolutely can implement tail calls, I don't know why this keeps getting thrown around. Adding a high-level directive in the spec doesn't enable the compiler to do anything, it just enforces it. The only thing preventing it is browser vendors wanting the .stack property to stay well behaved, but that isn't required by the spec and certainly isn't relevant for non-browser targets.


Most hardware architectures (and software ones based on them) the compiler controls the calling convention: how the stack is managed, what gets pushed there vs passed in registers, and so forth. The architecture may or may not have helpers like specific "return from subroutine" instructions that help manage the stack... but a lot of things are under the compiler's control.

WASM is not like that. It doesn't support jumps or that kind of manipulation. It does not expose enough control over the stack and calling conventions.

In theory could the compiler emit a virtual machine that simulates another machine where it _can_ control these parameters? Sure. It can rewrite it all into a huge loop-and-switch statement. But that is not going to result in anything close to efficient native code. By expressing tail calls directly in WASM the JIT can generate much more efficient code that takes advantage of the platform's native calling convention and tail calls can be _actual_ tail calls on the hardware.


This is simply false.

The compiler cannot implement tail calls correctly as it stands. You do not have access to modify the WASM stack and it's not present on the heap like it is for normal programs.

No compiler tricks can enable tail calls in WASM at the moment (with the exception of trampolines which always work and are absurdly slow).


I think they're referring to a compiler that is targeting WASM, not a WASM-to-machine-code compiler.


Your concern with burgeoning standards reinforcing the browser duopoly is well-founded, but unfortunately you're sticking your stake in on the wrong side of the dragon here.

It sounds like you're confused about how Wasm works, and you imagine that it's like a normal assembly language, with calls and returns implemented by the compiler using instructions that push and pop a stack in memory. You're probably right that that's how it should have been designed.

But that is not how Wasm works. The stack is not in Wasm's "linear memory". (This makes it much easier to compile access to local variables efficiently.) Calls, returns, and in a sense exceptions in Wasm are provided by the virtual machine at a fundamental level in a way that makes it impossible for a compiler targeting Wasm to implement tail calls, call/cc, or cooperative multithreading, except by compiling your entire program into a single function, which ruins the performance of existing Wasm implementations.

But, as Scheme has demonstrated, once the virtual machine supports tail calls, a compiler can implement exceptions and cooperative multithreading by way of compiling to continuation-passing style. So this is not the beginning of a long series of extensions; it's Lambda the Ultimate Goto.


The purpose of the wasm rules (eg declaring variables/function args instead of having push/pop operations) is to allow the compiler to efficiently determine that functions follow the calling convention and won’t be able to mess with non-wasm functions they call or are called by. It means that you can put return pointers on the stack without the wasm being able to ever touch return pointers.


> But, as Scheme has demonstrated, once the virtual machine supports tail calls, a compiler can implement exceptions and cooperative multithreading by way of compiling to continuation-passing style. So this is not the beginning of a long series of extensions; it's Lambda the Ultimate Goto.

Well to an extent. CPS can be extremely slow.

Scheme shows that continuations will also be needed down the road. But yes, that's about it.


IIRC, one shot continuations can be made pretty efficient, effectively folding back into a single stack. If there's a simple way to safely express this in wasm that might be the right balance. If you want multishot continuations you'd need to explicitly clone the one-shot continuation.


I think koka uses a different approach for its multishot continuations than scheme as well.


In theory CPS is very fast indeed, but of course our hero can easily collapse attempting to cross the chasm of the Sufficiently Smart Compiler. I suspect Chicken using CPS for everything is the main reason http://canonical.org/~kragen/urscheme, which doesn't implement call/cc, is so much faster than Chicken.


> Reference types maybe. But I've yet to see a convincing argument as to why they are absolutely necessary.

Are you referring to WASM GC here (or managed memory in general)?

If so, I think the main compelling reason is interop. Without GC in underlying architecture, any managed language targeting WASM must include its own runtime and GC. If you then want to write a polyglot program using multiple languages, you now have a single WASM executable containing multiple garbage collectors each managing their own objects. If you end up with cycles between those different managed heaps, you can end up in a situation where the GCs are unable to reclaim memory because they don't understand cross GC cycles.

This is why Chrome did Oilpan [1]: so that they could more easily handle cycles between the DOM, JS heap, and (aspirationally at the time) Dart VM heap.

A more general user-value proposition argument is that putting the GC in the WASM implementation means any improvements made to it are amortized across all managed languages that target WASM. Also, it makes applications written in managed languages and compiled to WASM smaller because they don't have to ship a runtime and GC with the app.

[1]: https://chromium.googlesource.com/v8/v8/+/main/include/cppgc...


I'm likely in the minority in my thinking, but I don't think targeting wasm with managed languages is such a great use case for wasm anyway, and so it wasn't worth building in GC to support. For the times you want to, something light like reference counting should be easy enough.


I'm biased since I work on a managed language that targets the web, but I strongly believe that it is a great use case for WASM.

Most programs targeting WASM are client-side apps with rich user experiences and I believe all but a small fraction of users would be better served implementing rich UIs in managed languages. Doing UI work in C++ (which I have done plenty of) is a special exercise is pain for almost no upside. GC is great.


I went back to that pain a few years ago after many years away from it. After a few months I got right back in the swing of things. Spending my days arranging and solving little memory ownership problems.I got to thinking: "This is programming". And I didn't really mind it.

But every so often I had to be honest and remind myself of what I had noticed at the beginning: Thinking about and managing memory is a colossal waste of time, even with smart pointers and the like - The machine can do that and has been able to do it efficiently for most things for a long time.


JavaScript is already a GC'd language effective at UI, so it doesn't seem to me like there's a strong need for wasm to fill the same role. Rather, it seems to me like wasm is better suited to accelerating processing-heavy tasks those rich web apps depend on, like custom [de]compression, ML models, data processing, image processing or simulation of some kind.


If your language uses a tracing GC, you can’t just add reference counting and expect existing programs to work without memory leaks


> I'm a huge functional programming evangelist, but high-level stuff like this does not belong in a low level language bytecode like WASM.

A tail call is a jump. Jumps are not "high-level stuff". They're very simple instructions. If WASM can't do a jump, it is actually WASM that you can't call "a low level bytecode".


WASM doesn't have jumps [1]. And it's not as low-level as you might expect. But even still you're assuming that low level means a specific model of computation a la PDP-11 that's as fictional as any other.

https://webassembly.github.io/spec/core/syntax/instructions....


WASM isn't particularly low-level, in that it's a compiler's IR, with a few concomitant restrictions on how control flow is allowed to work. Here's a piece on those restrictions, and how they make implementation more difficult:

http://troubles.md/why-do-we-need-the-relooper-algorithm-aga...

> WebAssembly isn’t designed as a front-end language for general programming though, so what’s the problem? Well, WebAssembly does have some constraints, specifically it must produce code that is valid. WebAssembly is a stack machine, and you can’t jump to just any label since that label might point to code that pops too many values off the stack, or pops the wrong types off the stack, or pushes too many values onto the stack.

Also, to summarize another part, when a compiler targeting WASM sees control flow WASM can't directly express, it has to implement it in a loop-and-switch form, something called the Relooper algorithm; therefore, compilers taking WASM down to native code have to understand that, and undo it back into a performant form. This adds up to a lot of work most VMs don't require, hence the proposal to move WASM towards a control-flow model friendlier to tail-call optimization, which would bring it more in line with physical hardware.


I never expected WASM to be low-level to begin with, so I don't believe it could ever possibly be "as low level as I might expect". Something like NaCl (the original one) perhaps could have been called "low level".

> But even still you're assuming that low level means a specific model of computation a la PDP-11 that's as fictional as any other.

I don't see how jumps are "fictional". Plenty of CPUs have them. Even actual stack CPUs, for that matter.


I feel like you're ascribing some judgement to the word fictional, it just means that machines, real or virtual, present programs a model of computation that is an abstraction over the implementation. One such abstraction is the notion of a program that is a block of instructions executed in sequence with a program counter, registers, memory, a stack. But that's not the only abstraction you could imagine. The JVM for example eschews manual memory management. And WASM eschews the concept of a program counter, and arbitrary jumps, and instead only allows structured control flow.


I feel like you're using the word "fictional" in a sense that renders all models of computation "fictional" to the same extent and hence renders the word "fictional" itself meaningless/uninformative. Personally I feel like there should be some distinction between features depending on how much supporting machinery you need to physically implement them. To me at least, "low-level features" are those that require the smallest amount of physical machinery to implement them. Things like bitwise operations, indirect addressing, arbitrary IP adjustments, etc. Definitely not things like enforcing stack discipline, enforcing block structure and such. (Just so that you know what I mean when I say "low-level". Opinions on that may differ very wildly, of course, and it's perfectly possible that other people use that term differently.)

I'm not saying that "fictional" things are bad. Tail calls themselves are "fictional" to me in the sense that all tail calls are jumps but not all jumps are tail calls, and the distinction between a tail call and a general jump is too difficult to make for a reasonably sized physical machine and best left to the compiler. But that is not me being judgmental against tail calls, of course -- I love tail calls and consider their lack a huge design failure wherever they're absent.


Jump as a concrete implementation is easy to understand (it's just math on a program counter).

But code is both a mechanical implementation and an abstract, often mathematical, concept. In the concept space, jump is as abstract as call, variable assignment, operator evaluation, etc. A Von Neumann machine is but one way to implement an abstract mathematical / conceptual machine.

In the mathematical space, there's no "high level" vs. "low level," there's just "features a language has" vs. "features it does not." There are other abstractions that are actually harder / impossible to do correctly with jump in the language (stack unwinding comes to mind, which is why C++ solves the exceptions vs. setjmp / longjmp dichotomy by... Not solving it, and a program that uses both will just do something undefined. C++ without setjmp / longjmp would be a language with fewer undefined behaviors).


> Jump as a concrete implementation is easy to understand (it's just math on a program counter).

This doesn't take into account what happens to memory on the stack if you jump into or out of the scope of a local variable. You can say that memory and initialization is not the implementation's problem and rely on the compiler to emit instructions that correctly adjust or initialize the stack before and after jumps. Then, yes, you get a very simple jump implementation.

But you also get an architecture that must either do stack analysis before it can safely run code, or you get an unsafe architecture. Those are both valid choices (the JVM does the former and native architectures do the latter), but there are real trade-offs with them.

A third way, that WASM does, is to say that control flow isn't simple, but that you get safe stack management for free with it.


Possibly, but when I talk about "low-level stuff", I usually imagine very concrete things that are suitable for physical hardware implementation. Things like what the simplest RISC-V chip is designed to do. Unlimited jumps definitely are one of those things. Notions of stack frames or block structured languages (limitations on jumps) etc. definitely aren't among them. So I couldn't possibly ever consider something like WASM to be "low-level", because it places constraints on your code that no reasonable physical CPU would ever enforce.

[EDIT, from a deleted comment of yours:

> The RISC-V chip is in the same family as the PDP-11 architecture. The Turing machine itself requires no JMP instruction, so it is possible to build a CPU without one. Has anyone done so? In general no, because the PDP-11 had profound impact on the world of physical computing. But one does see it from time to time; GLSL, for example, has no `goto` instruction because the graphics card shader hardware it was built to operate on does its magic by running exactly the same instructions on multiple data; jump would imply the ability for one of the instruction flows to be different from the others in its processing cell, which the hardware cannot support while maintaining the throughput it must to accomplish its task.

I get what you're trying to say here, but sort of a guiding principle for me here for many years has been what I call "the principle of minimal total complexity". While you can keep making the CPU simpler in many cases, if that causes your program to become excessively long in the sense that the total size of the description of your program AND the device it's running on starts actually increasing, that's a place you don't want to design yourself into in practice. In case of sequential machines, I don't consider removal of features that makes programs unnecessarily long "making the machine even more low-level", that's just "making the machine dumber" to me. Feel free to look at the Oberon system (both HW and SW) to see what I have in mind by that -- that seems to be about as minimal a complete system as I can imagine in practice.]


To be clear, I deleted that comment because it was inaccurate. GLSL has break, continue, and return, which are flow-control statements. It excludes goto, and my explanation for why it excludes goto is probably incorrect (but I don't have time today to rabbit-hole on why goto was excluded or why the SIMD architecture can support break and continue just fine while excluding goto).

> In case of sequential machines, I don't consider removal of features that makes programs unnecessarily long "making the machine even more low-level", that's just "making the machine dumber"

You may be interested to consider how incredibly complex the modern x86 architecture is to implement because it supports sequential program execution as a core invariant principle. As a result, modern computers (which strive to be faster than a PDP-11) have to do a massive amount of work to parallelize that sequential code because parallel execution is the only frontier of fast computation that remains. They literally rewrite the opcodes on the fly into something that can be SIMD'd and do branch prediction, where code is run speculatively just to discard the result. All to support the idea that deterministic flow control should be possible in 2022. It's a brilliant fantasy the chipset manufacturers have constructed for us so we don't have to reframe our thinking.

I think I get what you're saying, but in modern times calling jump "low level" (or, for that matter, calling branching in general low level) is a very "Do you think that's air your breathing?" kind of position. I'm not aware of anyone seriously considering approaching the challenge of all this implementation complexity by throwing out fundamental assumptions of our PDP-descendant instruction sets and saying "Here's a new machine code, it's designed for parallel execution, the first assumption you must throw away is the order in which any of these instructions are executed."

But I suspect we're getting very close to that day.


Maybe it was inaccurate but I got what you were trying to say -- that "incoherent execution" is difficult on SIMD machines.

> You may be interested to consider how incredibly complex the modern x86 architecture is to implement because it supports sequential program execution as a core invariant principle. As a result, modern computers (which strive to be faster than a PDP-11) have to do a massive amount of work to parallelize that sequential code because parallel execution is the only frontier of fast computation that remains.

I'm aware what recent CPUs do with ISA instructions. That flies very badly in the face of the minimum complexity principle as well. The amount of physical resources dedicated these days to making your legacy code run just slightly faster is exceptionally wasteful.

> but in modern times calling jump "low level" (or, for that matter, calling branching in general low level) is a very "Do you think that's air your breathing?" kind of position

Well, it's low-level for the kind of minimum complexity system that I'd consider ideal. Not necessarily for the abominations forced on us as an accident of history.


You might be interested in a fascinating game called TIS-100 by Zachtronics. The game is a lot of things, but a core premise is that it imagines a computer from a time approximately parallel to the PDP-11 that took the form of small compute components that were connected to each other instead of a monolithic central processor. It's a fun game, and it sort of raises the question of which is the "abomination[s] forced on us as an accident of history." Because when you look around at most of the biological world, you see heavily distributed systems with some centralization, but computers (man-made things that they are) are heavily centralized, clock-locked, deterministic... And energy-intensive. And slow.

History is arbitrary but not random, and it's fun to think about how things might have been different if the first machines started embarrassingly parallel with follow-up work to consolidate the data instead of embarrassingly centralized with us now in the era of how to make the monoliths fast. It's interesting to think about what's "ideal" about a machine that supports arbitrary jumps (and the global address space that demands, and the sequential execution necessary to prevent decoherence, and the memory protection demanded because some addresses are not executable code and should never be executed, etc., etc.).


> and it sort of raises the question of which is the "abomination[s] forced on us as an accident of history."

What I meant by that is the legacy of AMD64 having thousands of instructions, many of them with arbitrary opcode encoding, half-assed SIMD ISA instead of a proper vector ISA, and the ability to emulate an 8086, all purely for reasons of backwards compatibility. If you started designing a computing ecosystem completely from scratch, surely you wouldn't end up with an AMD64-based IBM PC descendant as your best idea you could come up with?


This would be kind of a sad direction to go in if we didn’t have any constraints, since we already have new architectures going in this direction anyways. When you take the crust of the ISA off though all processors look similar and that’s where the interesting bits lie.


I don't think web assembly is low level at all. Actually the name is confusing because it's really not much like an assembly language.


Compilers can easily lower tail recursion into loops, but not general tail calls between arbitrary (possibly unknown, indirect) functions.

The best they can do are trampolines, which come with a high performance cost.


> Wasm should only care about two things: Security and Performance

For languages that rely heavily on tail calls (functional languages in particular), good performance can only be achieved by having tail calls (or general-purpose jump instructions) as part of the instruction set.

Excluding tail calls hurts performance for these languages as they have to resort to techniques like trampolines. As apignotti suggested, it's impossible for a compiler to generate efficient code for tail calls if the underlying instruction set does not support it.


Not a compiler expert, but don't a lot of VMs for languages with TCO have a special tail call instruction? I know Luas VM does.

https://www.lua.org/source/5.4/lopcodes.h.html


Webassembly doesn't have a goto.

There are some special cased scoped branches but if you read the spec you'll see they are so specific that you can't branch back to the head.

(ex compiler dev)


WebAssembly is actually very high level. You need tail call support if you want to preserve the original structure of the program (which seems to be the intention.) I think you can get around it with eg CPS but then you'd get something more like traditional assembly.


There are two compilers involved, Source to Wasm and Wasm to native (usually JIT).

The second could easily optimize tail calls but it is required not to do so to increase reliability (all implementations need to agree which calls are tail calls that should be optimized and the simplest way to agree is to never do it) as otherwise a Wasm modules could work fine on an implementation and be unusable in another.

The first compiler cannot, the only possibility is to use whole-program transformations that significantly degrade performances.


An example of a tail-call that can easily be compiled to a loop is:

  let sum list =
    let rec loop a = function
      | [] -> a
      | x::xs -> loop (a+x) xs
    in
    loop 0 list
An example of a more tricky case which may be compiled to a loop is the following mutual recursion:

  let rec even_then b n =
    if n = 0 then b else odd_then (not b) (n - 1)
  and odd_then b n =
    if n = 1 then b else even_then (not b) (n - 1)
And this may be compiled either to a loop which uses a trampoline (or switch on the next function to run, which is a bit like a static trampoline) or to two separate functions that duplicate some code and are implemented as loops. A simple way this might get compiled is:

  fn even_then_or_odd_then(which_function, b1, n1, b2, n2) {
    loop {
      match(which_function) {
        Even => {
          if n1 = 0 { return b1 }
          else {
            b2 = not(b1);
            n2 = n1 - 1;
            which_function = Odd;
          }
        },
        Odd => { similar }
      }
    }
  }
  
  fn even_then(b,n) {
    even_then_or_odd_then(Even, b, n, null, null)
  }
  
  ...
An example of a tail-call that is hard to compile would be in a parser combinator library for example:

  let parse_both p1 p2 =
    fun buf pos success ->
      p1 buf pos (fun pos r ->
        p2 buf pos (fun pos r2 ->
          success pos (r, r2)))
  ;;
(As I’ve written it this design has some issues but they aren’t so relevant). Here it is important to be able to tail-call a function that you take as an argument (e.g. you want the parser p2 to tail call the success callback you give it, and if you parse and arbitrary-length list you want to do it tail-recursively). Speaking more generally, general tail-calls are like a safe long-distance goto. Wasm doesn’t have goto and so a whole-program transformation (or a trampoline[1] which may have performance issues) is required to support tail-calls.

[1] A trampoline means that functions effectively return continuations except typically they call their continuations (up to some max number of times) and only occasionally return them to the trampoline (which pops everything off the stack) which immediately calls the continuation again. But implementing continuation passing style this way is pretty slow.


The problem here is that for security, your interpreted code can not do whatever it wishes on your memory. And for performances, you can't validate every small memory operation your code does.

The result is a fairly high-level VM that doesn't export enough power for your program to implement GC or exceptions without sacrificing a lot of performance. Tail calls is a different matter, and there is a cost-benefit analysis for it.


We shouldn't have GC, Exceptions or Tail calls in WASM, as long as the compiler can provide them.

I’m not a compiler guy, but spent my time with runtimes closely, and can say that befriending GCs/RCs, exceptions, continuations, etc between them was my least favorite part.

That said, wasm is already a potential target for different runtimes built with no common vm in mind, so interop headaches are imminent either way.


I think its Rich Hickey who said something to the effect that tail calls are so fundamental that the underlying platform should be providing them.


tail calls are so fundamental that its trivial to build a call-push, return-pop stack calling protocol on top of them, but not the converse


Could you point me to some further info on this?


sorry, its kind of ... basic?

the kind of generalized tail call I'm talking about is generally referred to as a continuation. its easiest to think of it as a jump. so you can imagine that if we have jump and a stack register, we can push+jump and thats the same as 'call'. we can pop+jump and thats the same as 'return'. since 'call' and 'return' have side effects, we cant really use them to replace jump.

so thats really straightforward, but things do get a bit screwy. if we can support arbitrary jumps, and the frame storage (arguments + locals) cant necessarily go on a stack because our frames might have arbitrary liftimes, so we cant reclaim them in stack order.

so in a scheme for example we just pull in a gc and track references to these closures and dust our hands off with a smirk. if that doesn't work for you then you have to adopt some kind of framework that lets you know explicitly when these can be released. reference counting is a poor choice here because these references between frames can easily be cyclic.



And yet Clojure doesn’t have general tail calls (it has recur which allows something like tail calls so long as they are non-mutually recursive) and seems to do ok.


Yes its in the context of Clojure, his point is tail calls should be provided in JVM itself, it can then be added to Clojure.


Edit: this is wrong.

For posterity my original comment was:

“From what I understand, tail calls can always (?) be lowered to while loops[1], which are expressible in WASM.

1. https://en.wikipedia.org/wiki/Tail_call#Relation_to_the_whil...


This is possible, and trivial, when self-recursing: A -> A

If you have an A -> B, or A -> [indirect] call, that is not the case.


When you have a set of functions that may recourse into each other, you can convert them into a iterative one by doing a `while cond {switch function_selector {..}}` block.

It's an ugly piece of code, but because of parsers, there is a huge amount of know how on it.


You really can't.

For example, what if those functions are in different modules or libraries? Tail calls are still supposed to work.


If they are statically linked, it's not a problem, and since AFAIK wasm doesn't support dynamic linking, the only problem is on crossing security boundaries, that will be probably kept unsolved anyway. (IMO, this one is better left unsolved, it's to complicated a problem to require for interoperability.)


I am genuinely asking, is your position that a compiler cannot convert:

g(): a = 1+1; b = 2+a; print(b); return f()

into code that does not allocate stack space and just reuses the frame allocated for g()?


It depends on the calling convention used -- whether it is caller clean-up or callee clean-up. With caller clean-up, the stack pointer is left below the argument list when the function returns. With callee clean-up, it gets popped above the argument list. You can perform tail calls naturally if it's callee clean-up. (With caller clean-up, a compiler might hypothetically pull off a tail call if the callee's argument list takes less memory, but it's not something you could add as a language feature.)


Thanks, that explains the trouble I was having. I was thinking about it without any consideration to calling-convention.


Only tail recursion, not tail calls in general.


One need only look at the Clojure language to see how tail calls are not solved by a compiler.


Neat! This proposal caused me a lot of headaches, mechanizing its specification was the primary contribution of my Master's thesis a couple years ago[1]. I forgot until rereading it just now, but doing so caught a typo in the proposal specification[2], my extremely minor contribution to advancing WebAssembly.

Glad to see it finally moving forward after stalling for so long! Excellent work!

[1]: https://github.com/jacobmischka/uwm-masters-thesis/releases/... [2]: https://github.com/WebAssembly/tail-call/issues/10


That might be the shortest (in word count) Master's thesis I have ever seen!


Claude Shannon's masters thesis was quite short as well [0], around 25 pages!

[0] https://www.cs.virginia.edu/~evans/greatworks/shannon38.pdf


Looks typical for a master's thesis to me. Maybe you're thinking of a PhD thesis?


No, in my industry (mechanical engineering) Master (and also Bachelor) theses were always much, much longer. Longer lines, less vertical line spacing, many more pages. Lots of faffing about ('Introduction', 'State of the Art', 'Theoretical foundation', ...). Faculties urge supervisors and students to keep it below 100 pages (of relatively dense type).


Agreed on the formatting of mine being ridiculous with huge margins and line spacing, it wasn't my choice.


Ha well the mechanization was a nontrivial amount of work (for me at least) and was considered part of it too. If it's still short despite that, then welp I guess I got lucky somehow.


> tail-calls has been proposed as an extension to the WebAssembly standard.

Do you know why it wasn't in the standard to begin with?

Even ECMAScript 6 mandates PTC (proper tail call) - article from 2016 on Webkit.org no less - https://webkit.org/blog/6240/ecmascript-6-proper-tail-calls-...


The standard mandates it, and the V8 team implemented it, shipped it behind a flag, then unshipped based on reasons that initially seemed and ultimately were proven fuddy, when WebKit shipped PTC, and the world didn’t fall down.

The reason actual why it was withdrawn is that it would have required expensive changes to Microsoft’s Chakra (the calling conventions were incompatible).

Then Edge died... and Google didn’t add it back. Go figure.

Firefox also has problems implementing cross-realm tail calls. That could be spec’ed around, but there’s no will to do so.


These absolute bozos. A Web Assembler that can't do jumps. What a show.


My comment is about ES6 proper tail calls. WASM is another story


Graaagh, sorry.


It's not that much better. Apparently at least some of the idiosyncracies in the design of wasm, such as the lack of regular unstructured branches, is due to the internal implementation details of V8.


Every optimizing JIT for JavaScript has the same limitations.


Then maybe we shouldn't be repurposing VMs written for JavaScript for that brand new thing? Ditch the baggage and do it right. Especially given that, if successful, wasm is basically the next JS, and will be around for decades.


I'm all for rewriting things from scratch. That's why I'm doing a new Wasm engine from scratch. But reusing TurboFan is how we took Wasm from concept to near-native performance, shipped in Chrome, in 2.5 years.


That's all well and good - but, again, the spec will be around for decades. Compromises made today to ship things faster means a lot of pain in those years ahead. I pity all the compiler writers who will have to implement relooper again and again.

But instead we'll probably do things like asm.js - that is, bless certain wasm patterns such that advanced VMs would be guaranteed to optimize them. And so everything will be way more complicated than it needs to be. Just like the JS stack today.


That's why multiloop is coming.


Has there been any movement on that (or funclets) recently? All the repos I can find haven't been updated in a while.


If I understand what cross-realm means (code security), then I'm not sure that can be done at all.


Firefox can’t eliminate tail call across realms. The spec could make an exception for that scenario


That seems reasonable. Tail call elimination, limited to within a realm, still seems pretty useful.


Thanks for clarifying. What is this PTC you mention?


“Proper tail calls”, i.e. tail calls as specified for ES6


I am unsure, but I can make a guess. The first release of Wasm has been always considered a Minimum Viable Product, which a strong emphasis on Minimum. Pretty much only features already part of the legacy asm.js "standard" made the cut.

The fact that WebKit had some level of support for tail calls on the JS side is exactly the reason we choose it as the right platform to invest in.


How is ECMAScript 6 (high level language) related to WebAssembly (low level target)?


Wasm is meant to run efficiently on the same VM as JavaScript (an ECMAScript implemention), so Wasm features are restricted by browser VM current architectures.

(It is probably the main reason why Wasm only has structured control flow)


Finally :) My coworker back in 2017 implemented the WebAssembly backend for the Go compiler[0], and noted at the time that WASM doesn't have any equivalent to setjmp/longjmp. As a result, Go's WebAssembly implementation actually emulates a register machine on top of the WASM stack machine. Quoting him (some parts omitted):

> For example its architecture is a stack machine instead of a register machine. This means that it isn't immediately suitable as simply yet another target at the last stage of the Go compiler next to x86 and friends.

> There might be an alternative: Emulate what we need. We may use the stack machine to emulate a register machine and hopefully do so in a reasonably performant way.

> WebAssembly has linear memory with load and store instructions, which is good. We would not use WebAssembly's call instruction at all and instead roll our own stack management and call mechanism. Stacks would live on that linear memory and be managed by the Go runtime. The stackpointer would be a global variable. All code would live in a single WebAssembly function. The toplevel would be a giant switch statement (or WebAssembly's br_table based equivalent) with one branch for each function. Each function would have another switch statement with one branch per SSA basic block.

> There are some details that I'm omitting here, but in the big picture this looks like a decent register machine to me. Of course the performance of this heavily depends on how well WebAssembly can transform these constructs into actual machine code.

[0] https://github.com/golang/go/issues/18892#issuecomment-30931...


LLVM and other compilers that use SSA but target a stack machine can run a stackification phase. Even without reordering instructions, it seems to work well in practice.

In Virgil I implemented this for both the JVM and Wasm. Here's the algorithm used for Wasm:

https://github.com/titzer/virgil/blob/master/aeneas/src/mach...


Wow sweet! How do you find cool coworkers like these?


This is wonderful news! Thank you so much for doing this, and doing it right - this is much better than my idea of making a bad MVP web browser to provide second implementations to push through proposals when other vendors drag their feet.

As an author of a functional compiler that targets wasm, this is a dream come true - yall do really cool work.


I'm using Blazor (C#) WebAssembly and I'm really wishing it could do DOM manipulation. My favorite tool for that is Dart, so I'm working on marrying C# and Dart for my client solutions.


I was struck by how reactionary some of the responses to this comment are. Many people seemed to feel obliged to firmly question your preferences. Kind of odd, I don't normally notice that as much here on HN.


I expected it because over the years any Dart or Microsoft positive statements I've made have received pushback. Some people can't let go what Microsoft did 20 years ago and assume they are still evil (with GitHub Copilot for example.) Others who are JavaScript programmers can't let go that a Google exec said at the outset that Dart was meant to replace JavaScript. They have fought it over the years.

I almost felt like I was trolling by combining the two, but I'm actually using Dart and Blazor to see how they can work together because Blazor Wasm doesn't do DOM manipulation and Dart does it in such a easy to use fashion that the marriage feels natural to me.

I try not to attack back and have a bit of a sense of humor about it, but it does feel like suppression of thought on HN


Well personally I wasn't questioning the preference itself, I was curious about the purpose of the preference since I've seen people avoid JS/TS both because they don't like it and because it struggles to accomplish certain things. I was simply curious which it was (if either).


Genuine question: is the goal to get something that is not achievable with JS/TS, or is the goal to simply avoid JS/TS?


I really dislike JavaScript for a number of reasons. Dart gives me more distance from it than TypeScript. Of course, I can't avoid it, but I don't have to deal with many annoyances such as which 'this' is this? Should I put 'this' into a var called 'self' to be safe? That's just one example of how JavaScript and I don't get along. I understand some people have brains that think this way, but mine doesn't


The comment reads like you used JavaScript 10 years ago. For instance, just use arrow functions and this remains untouched.


I know but that's just bizarre. One function has it's own this and another doesn't so if my arrow function gets big and I want to make it a regular function I may have to rewrite it. Ugh.

Edit: I'm thinking of cognitive load too. I like to eliminate load I don't need. Keeping track of this is not something I want to do with my life

https://drpicox.medium.com/reducing-programmers-cognitive-ov...


> so if my arrow function gets big and I want to make it a regular function I may have to rewrite it.

There's absolutely no reason to make an arrow function a regular function just because it goes past a certain number of lines.

It really seems like you should become more familiar before forming an opinion. There's valid criticism to be made, but those aren't it.


I sometimes refactor code for purely aesthetic reasons. To each his own.

http://www.synergeticapplications.com/ergonomics.htm

By the way, I feel like I'm at a bar and you're negging me


or, better yet, stop using this


What do you like about Dart over TypeScript? The type system in TypeScript is far superior to Dart's. Other than that, the languages are more-or-less the same.


Actually, I'm taking another look at TypeScript today. I liked it well enough when I last investigated it in 2016, but the code I wrote then was just a veneer over jQuery. I'm sure things have improved dramatically, which I am about to find out. Frankly, a Turing complete type system [1] seems like over-kill to me and a complication I don't need.

Still, Dart had everything I wanted in a front end language in 2013. I think JS devs were short sighted to reject it out of hand. It's a shame it never caught on by itself and not as a just a Flutter language. I read that Google is using it in Fuchsia so there is still a chance Dart will be big in the future!

[1] TypeScript and Turing Completeness: https://itnext.io/typescript-and-turing-completeness-ba8ded8...


I highly recommend TypeScript. I don't really like the JS runtime, but, purely from a language point of view, TS is my favorite.

Some cool features:

1) Type unions

interface A { a: string; }

interface B { b: string; }

type C = A | B;

const c: C = { a: 'a' };

2) Type assertions if ('a' in c) { /* compiler knows c is of type A here */ }

function isA(c: A): c is A { return 'a' in C } // compiler knows c is A if this returns true

3) Nice helper types interface A { ... }

ReadOnly<A> // Like A, but all properties are read-only

Partial<A> // Like A, but all properties are optional

Omit<A, 'someProp'> // Like A, but without 'someProp'


Interesting. I'll learn more today as I rewrite one of my Dart components into TypeScript. But I also dislike Node and Npm so I'll try Deno first.


TBH, TypeScript's type system really impressed me. It's the strongest type system of any language I regularly use (I haven't had time to unpack Rust yet, and I learned enough Haskell to decide Haskell didn't help me solve problems I had).


It’s an expressive type system, but ime it allows developers to go crazy on type interdependencies and general entanglement, so you can’t just go to the “header” and quickly figure out what your method or a return value really is, despite TS has structural typing.

E.g. look at this: https://github.com/telegraf/telegraf/blob/v4/src/telegram-ty...


That's definitely overwhelming. After reading it for 5 minutes though, there is a type called Telegram and it has a field called 'Opts' that has a lot of different fields. MakeExtra lets you take a type from 'Opts' and exclude certain parameters. Probably do something like "you can set the chat message contents, but you can't set the chat id"



Surely it's easier to stick to JavaScript The Good Parts than to introduce a whole new layer in the form of Blazor?


Blazor is not equal to JavaScript. Blazor is Microsoft's answer to React and Flutter. So far of the three, I like Blazor, but not the server version that uses SignalR to communicate with the client because it locks you in. Blazor Webassembly is awesome, but just needs a good DOM manipulation library. With Blazor Wasm on the client, I can use anything on the back end.


> I really dislike JavaScript for a number of reasons.

Might I suggest being less sensitive/that your reasons are overblown?


I don't like liver and horse radish either. Should I eat that because a lot of people like it?


Yes, let's keep using JavaScript for all eternity now. It's COBOL of this century.


What's C# giving you that Dart isn't, OOI? My impression is they're fairly similar languages. (Does Dart not have a WebAssembly backend?)


Blazor is using C# to build web pages, which isn't new, but using C# on the client with a webassembly version of .NET. It works well enough, but is limited to simple things like onClick="SomeCSharpFunction" and can not do things like getting elements by id or class. That's were Dart shines. Most will use either JavaScript or TypeScript, but I like Dart so I want to see how it is to combine Blazor Webassembly with Dart.

It would be even more interesting if there were a Dart.NET


My point is: why C#, why Blazor? What's that doing that you couldn't just do in Dart itself?


For someone about to go knees deep into Blazor very soon.

Do you have any gotchas which isn't really mentioned on MS documentation site? primarily related to performance, since it will be one of my main concerns.


If one of your main concerns is performance, then all I can say is don't use Blazor. WASM will never be able to compete with regular JS so long as it has to go through an interop layer, plus it has to send down and parse an entire interpreter before it can even start looking at your actual UI and business logic. Server Side has to wait for a server response for every single action you take, and any small drops can make the entire thing sluggish, or even buggy.

That's not to say it isn't cool and can't work for some projects where say, development speed is more important, but performant it just aint.


If you're talking site load time you'd have to be careful because netdot is compiled as wasm, which doesn't leave a lot of room below the standard site size. It's hard to tell because even ordinary sites now are huge. I need to do some testing.


What's wrong with JS bindings? Surely DOM manipulation is slow enough for WASM/JS overhead not being noticeable.


You still need to go through a port with Blazor, but you can minimize those calls for the most part. DOM manipulation was slow as recently as a few years ago, but improvements to Blink have made it blazingly fast. I agree with Svelte that we no longer need a virtual DOM

https://svelte.dev/blog/virtual-dom-is-pure-overhead

Edit: I didn't answer your question. There's nothing wrong with JS bindings. Both Blazor and Dart use them, which could get awkward going from Blazor -> Dart (as JS) -> JavaScript lib. I'm considering TypeScript again because no port is needed to call JS libs.


How can I help this?!

I’ve been wanting tailcalls to be in wasm for years! I consider it a blocker for all sorts of uses I’d be interested in


You can use F# with WASM. I know F# convert recursive tail calls to loops but I understand this is another case. But shouldn't F# have the same problem?


I have not been following WASM generally, but my understanding after reading this:

Unlike regular machine language, WASM does not allow you to control the call semantics. That is, if you emit x86 machine instructions you can handle TCE entirely in your own compiler, the machine itself does not care.

WASM has special call instructions, these perform basically a conventional stack frame/call stack approach and you can't short circuit it. The closest you can get, which works fine for auto-recursive functions and maybe some detectable mutually recursive functions, is to convert tail calls into loops within the compiled WASM output. So recursive tail calls work fine in WASM if your compiler doesn't turn them into calls, but leaves them as a self-implemented (by the compiler) loop structure using whatever appropriate WASM instructions are available. This is fine for auto-recursive functions, but puts the burden at implementing TCE for them on the compiler writer. Which means you'll get it for some languages that compile to WASM, but not all. And only for the limited cases they support (probably restricted to auto-recursive and some mutual recursive circumstances).

What this proposal introduces is a new version of the call instruction that would be used in tail call positions. Then the WASM interpreters would do the work of actually combining the stack frames internally. Detecting that a call is a tail call is actually pretty straightforward for a compiler. Given that it is a tail call, it can emit (or not) the new call instruction and will get TCE "for free". It doesn't have to do a code transformation from recursive to looping, and it can be applied more broadly than just recursive circumstances.


F# compiles to CIL bytecode, which has a tail call instruction. I'm not sure what happens to that when CIL is compiled to wasm, but since there's no way to do a general translation, it probably just drops the tail part and makes it a regular call.


Self recursion is a special case that WASM backend targets can more or less efficiently resolve the same way a human can convert a recursive function into a non-recursive one: explicitly put arguments onto your own simulation of a stack and perform the computation in a loop.

Unfortunately self recursion is only a subset of tail calling.



I'm working on a relational stream processor and this is pretty critical to making it work well. a runtime standard with support for network connections would be nice too, but lets not get greedy


> support for network connections

I wish daily for that as well, but stay tuned: we _might_ have found a half-satisfactory solution for that ;-)


pointer to a draft?


I want threads more than tail calls. WASM has processes with limited shared memory, but not real threads. This is a headache for porting high-performance games to the web.


Tangential but what's the status of garbage collection and DOM manipulation in WASM? Are we ever getting those? I understand it's a high-value technology without them, but I'm interested in writing full apps in say, OCaml (so I'm glad to hear that WASM is getting TCE!).


GC is making a lot of progress. There are VM and toolchain prototypes. You can compile Java and Dart to wasm on those today and it generally works and is pretty fast. (There is also a Kotlin prototype but I have less information about it.) Most of the big spec questions have also been resolved.

DOM manipulation hasn't changed - you still need to call into JS to do those. Ideas like WebIDL bindings have been proposed over the years but haven't shown enough benefit. JS is better for DOM-heavy code, while wasm excels at computation-heavy code. But you can write bindings from wasm (maybe someone already has for OCaml?), which is what toolchains do today - not as fast as JS, but often good enough.


I will add to this that we are in a healthy design loop that is tightening in on what I feel is a reasonable final design that is implemented in at least one high-performance engine, V8. AFAIK there are Igalia folks trying to get a parallel implementation in JSC to meet the 2-engine bar.

GC will not directly make DOM manipulation easier, but it will make it easier to integrate DOM references into a managed language that compiles to WASM, since it obviates the need for indirections through tables. This feature alone is majorly opens up capabilities for WASM on the Web!


> JS is better for DOM-heavy code

If you only look at performance, yeah. Performance isn't the only thing we get from WASM, and it would be nice to get the other benefits without having to sacrifice it.

(That said, the marshaling needed for DOM access isn't that relevant, I wouldn't say this is a high-priority problem and would prefer people to focus on GC instead, like they are doing.)


Kotlin 1.7.0 shipped with a WASM compiler recently that currently requires some experimental flags in Chrome to enable features related to garbage collection, threads, and memory management. So, it looks like this is progressing nicely. Once this lands, I imagine there will be a whole range of languages that will be able to make good use of this.

As for DOM manipulation, that seems to work well enough but I'm sure there are further improvements coming.


While I'm also looking forward to TCE and GC for WASM, you can build full apps in OCaml now via js_of_ocaml.


I'm aware of js_of_ocaml and Reason, but I want it to run closer to native speed and use multithreading eventually.


Have you looked into ReasonML by any chance?


I lost track of it since the Reason/ReScript split.


That sounds super nice! Maybe this will turn out to be a nice way of compiling languages which make heavy use of tail calls to the browser (I'm thinking Haskell, Purescript, Erlang, Ocaml). For purescript specifically this means no MonadRec needed which is quite exciting! (if and when a wasm backend is made, ofc)


I have to resist the urge to implement Scheme in this, right now, but I hope someone else has a chance to try this out for Scheme soon, and help inform the standard before things finalized.


Why do we need an explicit construct for this as opposed to e.g. having the compiler replace the tail call with a goto back to the beginning of the function?


This was discussed elsewhere in the thread as well - the main problem is TCO support between multiple functions - e.g. when visiting a tree with nodes of various types, such that you have visitSumNode(&total) -> visitChildren(&total) -> visitMinusNode(&total) -> visitChildren(&total) -> ...

Since WebAssembly has a concept of functions, and doesn't allow jumps outside of the current function for security reasons, a compiler can't eliminate tail calls across functions (unless it can in-line all of those function calls, which isn't always possible, nor it is always the best performance).


Would this mean that relooper etc. are no longer needed since you can just have all your basic blocks as functions and use tail calls to jump between them?


The industry doesn’t care about functional programming, are tail calls really necessary?


Thanks for doing this!


Thank you for this!


Please don't let apple block this proposal like they did for browsers. We don't have tail calls because apple decided they couldn't be bothered to implement it.


Safari has tail-call optimisation.

We don’t have it elsewhere mostly because Google:

1. Agreed to implement it [it’s in ES6]

2. Implemented and shipped it behind a flag

3. Unshipped it.

4. Proposed something else [ https://github.com/tc39/proposal-ptc-syntax ]

Prior to that, Firefox had proposed a carve out for cross-realm calls, but then they didn’t bother implementing anything.

While apple is against Syntactic tail calls, they’re mainly just opposed to versions of it that would remove/unrequire the tail-call optimisation they already do: https://github.com/tc39/ecma262/issues/535

For the version of it that is backwards compatible, they wouldn’t need to do anything other than ignore the syntax. Their main concern is that it "could add confusion with very little benefit."


Safari is the mainstream browser with proper tail calls implemented. Was there some historical point where Apple blocked it and so others followed along but then Apple turned around and did it anyways?


JavaScriptCore did not block tail calls. In fact, they were one of the biggest proponents of it because of their Shadow Chicken approach.


ELI5: Tail-calls?


A tail call is a function call occurring in the final position of a function:

  void foo(...) {
    ...
    bar(x,y,z); // <= a tail call
  }
If the function has a return value (vice void like above):

  int foo(...) {
    ...
    return bar(x,y,z); // <= still a tail call
  }
In the way most languages are compiled, function calls generate a new entry in the call stack (a stack frame). This is necessary for all non-tail calls in order to handle the bookkeeping around what to do when the call finishes, how does the caller resume.

With tail calls, that additional stack frame has no real value (outside, maybe, debugging information to give you a stack trace but traces can be collected other ways). Tail call elimination (or tail call optimization) will reuse the current stack frame rather than construct a new one. This reduces the memory overhead (you aren't constructing unnecessary stack frames) and gives some performance improvement (less bookkeeping overhead, and the function call becomes a simple jump). These two functions can, in principle, get compiled to the same thing if you have TCE:

  uint factorial(uint n) {
    uint acc = 1;
    for(; n > 0; n--) {
      acc *= n;
    }
    return acc;
  }
  uint factorial(uint n, uint acc = 1) {
    if (n == 0) return acc;
    return factorial(n - 1, acc * n);
  }
But while that's a recursive example, tail calls and tail call elimination (TCE) aren't just for recursive cases. My first two examples, though they are just sketches, show a non-recursive example of tail calls. With full TCE (it isn't uncommon to have TCE only apply to self-recursive cases) those examples would also have tail call elimination performed.


A contrived example: https://godbolt.org/z/17T5MzvGY

The compiler is able to optimize the function calls in the return statement. Note how the calls are compiled into a jmp instruction, instead of a call instruction. This means that it doesn't need a new stack frame for each call. Even if the "x" is very big, it won't blow the stack.

The most basic application of tail call optimization is when you optimize a recursive function. It essentially turns the recursion into a loop. In fact, is how you write loops in certain functional languages. But TCO is not jut for that. It can also be used when one function calls a different function, like in the first example I linked. In this case the gotos can model any state machine, not just a self loop.



Tail calls are super important for non-C like control flow, and it's great that it might be added.

But what I really think wasm should focus on is to get near native performance (say, <50% overhead). Until that happens, the whole endeavor seems pointless to me.


Estimates vary (because benchmarks and use cases vary), but it's generally faster than 50%. That number was seen here,

https://www.usenix.org/system/files/atc19-jangda.pdf

(I assume that's what you refer to?) It's a good measurement, but it's from 2019, and it's just on 2 wasm engines. There are other estimates, like here:

https://kripken.github.io/blog/wasm/2020/07/27/wasmboxc.html

That tries to measure the fundamental overhead of wasm's sandboxing as opposed to a specific wasm VM, and it finds just 14%.


That seems very promising. If it really is that good, wasm has a lot more potential in my eyes.

But e.g. sharp/vips ended up with a much worse result: https://www.libvips.org/2020/09/01/libvips-for-webassembly.h...

It may just be a matter of waiting for simd and threads though.


“proposal to move WASM towards a control-flow model friendlier to tail-call optimization, which would bring it more in line with physical hardware.“

perhaps it is nigh time to bring the CPU hardware closer to the current WASM design.

Might simplify a lot of issues related to pipelining, cache-busting, and Spectre-like mitigations, not to mention crazy varied breakout of legacy microcode to subfunctions (Intel, I am looking at you as well).

But what do I know, I just play with Unicorn engine.


meanwhile, the new Spectre-BTI variant just now embroils both AMD and Intel on media.

Naysayer (downvoters) seems sure that JavaScript engine having this new tailcall would not be impacted. Of course, I would be talking about the generated microcode, not the JavaScript LIR bytecode.

https://arstechnica.com/information-technology/2022/07/intel...


For those who specialize in emitting platform-dependent native machine code translated from macrobytecode to MIR to LIR, following paper outlines the pitfalls of tailcall:

https://www3.cs.stonybrook.edu/~mikepo/papers/devil.ndss15.p...

Speaking in Firefox-ese, within its JavaScript engine, IonMonkey taking JavaScript bytecode down to Mid-level Intermediate Representation (MIR), then OdinMonkey (TraceMonkey) translates to Low-level Intermediate Representation (LIR), then for WarpMonkey (NanoJIT) to translate into native machine code.

OdinMonkey and WarpMonkey should not be handling tailcalls.


Why not focus on features that would unlock better integration of actual mainstream languages - Java, C#, Python etc.?

Let functional programmers discuss monoids on their endofunctor forums or something.


Actually even C++ can benefit from this.

In LLVM C++ coroutines compile down to functions with the "musttail" attribute, that right now is not possible to express in Wasm [1].

It is also useful to efficiently implement stuff like interpreters, where you use it as a way to jump to the code of the next instruction directly (Wasm has only structured control flow, but you can see a tail call as a sort of jump/goto instruction)

[1]: https://discourse.llvm.org/t/supporting-coroutines-without-t...


Recursion has been a part of programming and programming languages for a very long time, and not just the functional ones. See ALGOL for recursion from very early on.


It isn't one or the other. Work on Wasm GC is ongoing and separate from this.


Recursion indeed can be useful when working with trees or graphs. However, I don't like using recursion instead of a loop, because you make reader to unroll your code in their head to understand what it really does. If you need to add two arrays of numbers, use loop or array addition, but don't use recursion as a replacement for a loop.

Sadly the article uses a poor example, writing a useless factorial function. I have never needed such a function in production code, and if I needed, I would write it using a loop. Using bad examples like this might create an impression that recursion is not useful in real code (which is wrong). It would be better if you have used an example that looks like real code and that would become less readable without recursion.


> If you need to add two arrays of numbers, use loop or array addition

There are plenty of algorithms that make more sense when expressed using recursion. Iterating over a list of numbers generally isn't one of them. But walking a tree is a good example.


More specifically: if you don't want to recursively iterate a tree, you have to hold a list of prior nodes to return to anyway... which is no different from weaving that information into the stack. Non-recursive iteration in this case literally requires you to emulate the recursion.


Those algorithms usually use non-tail recursion. Most tail recursion uses usually is trivial to rewrite in a loops.


> Most tail recursion uses usually is trivial to rewrite in a loops.

Except for tail calls that aren't self-calls. Which for code I write is actually fairly frequent.


Readability is really not a relevant factor for WebAssembly. Having tail-call recursion support is critical for the languages that compile to Wasm that use tail-calls themselves.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: