Tell HN: We are trying to get tail calls into the WebAssembly standard

j-pb · on July 12, 2022

Sorry if I miss something obvious, but how is this not solvable by the compiler?

I'm a huge functional programming evangelist, but high-level stuff like this does not belong in a low level language bytecode like WASM.

Wasm should only care about two things: Security and Performance.

With the standard blowing up like crazy we'll get neither. Worse, we'll cemenent the current duopoly of browser engines, because we'll make it (again) so complex that no-one can create an alternative.

We shouldn't have GC, Exceptions or Tail calls in WASM, as long as the compiler can provide them.

What we should have is multi memory support and memory read permissions, because without them WASM lacks basic security primitives.[1]

Reference types maybe. But I've yet to see a convincing argument as to why they are absolutely necessary.

Everything else (as long as it can be done by the compiler) is just bloat.

1. https://www.usenix.org/conference/usenixsecurity20/presentat...

apignotti · on July 12, 2022

Tail-calls is fundamentally something that the compiler _cannot_ solve. Trust me, if there was a way we would have avoided ourselves all this work.

The issue is that to avoid stack blow-up you need the engine to recycle stack frames. You might argue that this could happen implicitly in the VM, with all the calls in tail position being automatically converted to tail-calls. The problem with this is that in WebAssembly (and JavaScript) the call stack is observable via the .stack property of thrown exceptions.

Since an implicit conversion would be observable engines cannot simply optimize the problem away, and that is the reason why new opcodes (return_call/return_call_indirect) are actually required at the WASM level.

For the specific case of direct calls (return_call opcode) the compiler could solve the problem by inlining, with some luck. But for the case of return_call_indirect there is no other possible solution.

keithwinstein · on July 12, 2022

Heya,

(1) Thank you for implementing this in JSC!! I hope they take it, it makes it into Safari, and the tail-call proposal advances.

(2) I don't think you are exactly right about the call stack being observable via thrown exceptions. There's no formal spec for the v3 exceptions proposal yet, but in the documents and tests, there's nothing that would change in WebAssembly core to make the call stack observable. (There's no ".stack property" that would be added to Wasm itself.) It's true that the proposal amends the JS API (but only the JS API) to describe a traceStack=true option; from Wasm's perspective I understand that's just an ordinary exception that happens to include an externref value (just like any other value) to which Wasm attaches no special significance. The Web-based engines can attach an informative stack trace if they want, but there's no requirement preventing frames from having been optimized out. The non-Web engines won't have to think about this.

(3) I think the real reason that a Wasm engine can't implicitly make tail calls proper is that the spec tests forbid it, basically because they didn't want the implementation base to fragment by having some engines perform an optimization that changes the space complexity of a program, which some programs would have started to depend on. (The spec tests say: "Implementations are required to have every call consume some abstract resource towards exhausting some abstract finite limit, such that infinitely recursive test cases reliably trap in finite time. This is because otherwise applications could come to depend on it on those implementations and be incompatible with implementations that don't do it (or don't do it under the same circumstances.)")

But the issue is much weaker than "call stack is observable" -- it's more like "infinite recursion must trap eventually, but it can be nondeterministic when."

More discussion here: https://github.com/WebAssembly/spec/issues/150

dataflow · on July 12, 2022

> Implementations are required to have every call consume some abstract resource towards exhausting some abstract finite limit

If an implementation really wanted to, they could get around this by incrementing a "function call counter" that traps at, say, 2^64, rendering it effectively moot.

I feel like these kinds of situations are where being practical might make more sense than being mathematically precise. Something like "each function call must consume at least N bits of memory" or something concrete like that. Or heck, "implementations may not perform tail-call optimization" or even "implementations must be able to reconstruct the full logical call stack at any point".

wyager · on July 12, 2022

> each function call must consume at least N bits of memory

This is going to be a problem for any long-running recursive program. Why would we mandate this?

haberman · on July 13, 2022

You don't want it to be implementation-specific whether tail call optimization is performed or not.

The nightmare scenario is that you spend months writing a program that works great in browsers A and B, but when you roll it out to prod it immediately fails in browser C, which doesn't perform TCO.

You only want to write code that depends on TCO if you can get a guarantee that TCO will be performed. This is the motivation for the clang:musttail attribute in C/C++ (I implemented musttail in Clang): https://clang.llvm.org/docs/AttributeReference.html#musttail

MereInterest · on July 13, 2022

This might be a silly question, but why is that a nightmare scenario? The exact same argument could be made for any compile-time optimization. If one compiler inlines a function where another doesn't, or unrolls a loop, or performs better constant propagation, any of those could impact the resource usage between the two compilers, leading to exactly that same scenario. But I wouldn't want to forbid improvements altogether out of a desire for consistency.

To me, I'd see a distinction between code that requires a specific optimization and code that could benefit from that optimization. If the standard forbids an optimization to avoid the former, it also removes the latter.

quietbritishjim · on July 13, 2022

That is exactly why tail call "optimisation" is not, in fact, an optimisation.

Indeed much of this discussion, including the title and text of the original post, carefully avoids using the word "optimisation". Somehow it got introduced at some point in these comments.

To flip it the other way: consider a normal (non-tail-call) program and choose a local variable in a routine that happens to be called quite a lot. Replace it with a list that you append to every time you call the routine, even though only the last entry is ever examined. The program will leak memory, potentially quite quickly, and eventually crash. Is it fair to say it's just less optimised than before? I would say it's worse than that: it has an actual bug.

That's exactly the situation in a program created with the assumption of tail calls that is then used in an environment without them.

MereInterest · on July 16, 2022

> Replace it with a list that you append to every time you call the routine, even though only the last entry is ever examined. The program will leak memory, potentially quite quickly, and eventually crash.

I had to think on this example a bit, but I believe these programs exist and are quite common. This is an accurate description of any program in which memory is allocated and never explicitly freed. Some languages include runtimes that perform garbage collection, but until the garbage collection runs, the program is leaking memory and would crash without the garbage collection.

Which is to say, I think I agree with you, but had to chew over the example a bit. That it isn't a quantitative difference in speed or memory usage, but a qualitative difference in the types of programs that are acceptable in a language. Thank you for it.

haberman · on July 16, 2022

> Is it fair to say it's just less optimised than before? I would say it's worse than that: it has an actual bug.

Yes, a program that depends on tail call optimization, despite having no guarantee of tail call optimization being performed, has a bug.

That doesn't mean that tail call optimization is not an optimization. In languages that allow, but do not guarantee, tail call elimination, it is an optimization performed by the optimizer that makes a program run more optimally than it would have otherwise. There is a long history of referring to it as an optimization, for example in the docs of LLVM: https://llvm.org/docs/LangRef.html#call-instruction

I don't think it makes sense to redefine the word "optimization" such that it excludes examples where a program could erroneously depend on a certain optimization being performed.

quietbritishjim · on July 18, 2022

OK, if you prefer to be precise, there are two situations:

(1) In languages (or runtimes) where tail calls are not guaranteed to be elided, tail call elision is an optimisation. In that context, you could quite reasonably describe it as tail call optimisation. Programs that are written in these languages and rely on tail call elision are buggy.

(2) In languages (or runtimes) where tail calls are guaranteed to be elided, tail call elision is not an optimisation. In that context, you cannot reasonably describe it as tail call optimisation. Programs that are written in these languages and rely on tail call elision are not buggy (at least, not for that reason!).

This conversation is about making tail call elision a guaranteed feature of WASM. Therefore, in this context, it is not reasonable to describe tail call elision as an optimisation.

haberman · on July 13, 2022

Most optimizations only offer constant factor improvements. Tail call optimization is the difference between O(1) and O(N) stack size.

phkahler · on July 13, 2022

Tail calls can be optimized at the source code level too. When I first read about them decades ago I thought it was a stupid notion due to me not understanding how a good compiler can optimize them away. Then I wondered why certain people felt the need to write code that way. I still wonder that.

haberman · on July 13, 2022

I wrote an article about my motivation for it: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

phkahler · on July 13, 2022

Thanks for that. Having written a threaded interpreter before that immediately made sense. It does seem like you used some other hints to the compiler as well so this would fall under "experts only" type of code and that's OK. I still wonder how well good old "goto" might do, but once you move away from a huge switch/case to individual functions I can see how it might help a bit.

I still don't understand why some people seem to like the idea of iterating simpler things with recursion using a tail call.

netheril96 · on July 13, 2022

> The exact same argument could be made for any compile-time optimization.

No.

For other optimization, the difference is the speed of execution. For tail call optimization, the difference can be normal execution and stack overflow.

MereInterest · on July 13, 2022

Any optimization that removes an unnecessary stack allocation could also be the difference between normal execution and stack overflow, due to decreasing the size of the stack frame.

netheril96 · on July 13, 2022

But the effect is much more likely in tail recursive cases.

dragonwriter · on July 13, 2022

The reason for the existing mandate is to be a problem for recursive programs, so it is a feature, not a bug, of the proposed refinement that it more reliably achieves that.

(Specifically, to prevent one implementation getting an optimization for them that leads people to rely on it, making code that blows up in practical use on other implementations, despite both being nominally correct implementations of the same spec.)

dataangel · on July 13, 2022

Try writing a loop over all 64 bit values. The computer will break before the loop finishes.

Retr0id · on July 12, 2022

It is implicitly already mandated, depending on how finite your finite limit is.

martinflack · on July 13, 2022

> The spec tests say: "Implementations are required to have every call consume some abstract resource

It's a nice detail for you to highlight, but surely that exists in the test spec specifically because the opcode spec doesn't mandate tail-call handling. If the latter did (a la Scheme), then the test spec would be updated to no longer have this note...?

dan-robertson · on July 13, 2022

The proposed spec is to have an explicit tail call instruction separate from the regular call instruction for which resources must be used.

I think this would be the correct design even if tail calls were desired from the start. It makes for a better debugging experience and allows language implementations more control. (A trend in some functional languages is to require explicit annotation when tail recursion is desired, both to improve debugging for regular functions and to allow the compiler to check that you don’t accidentally stop being tail-recursive)

j-pb · on July 12, 2022

Fair enough. Sounds more like an issue with .stack being an observable property though.

I've long suspected exceptions making it into WASM being it's million dollar mistake, and this further confirms it.

miloignis · on July 12, 2022

I haven't messed with .stack, but on first impression I agree with you that it seems like a mistake.

Even if it wasn't observable though, I think guaranteed tail-call elimination via either a new opcode or having it be a required property of regular calls in the spec (we already missed that ship, of course) is important. Without it, proper compilation and execution of some languages on wasm would depend on an otherwise invisible property of the engine - that is, tail call elimination isn't just an optimization, it's a critical aspect of language semantics that needs to be guaranteed to be relied on.

coliveira · on July 12, 2022

If it is a mistake depends on the point of view. What seems to be more important: exceptions of tail-call elimination? Most modern languages use exceptions in one way or another, conversely very few use the later feature. From the point of view of existing software it is way more important to optimize VMs for exception handling than for tail call elimination.

miloignis · on July 12, 2022

Well wasm doesn't have exceptions right now either! The exceptions proposal is also in phase 3, like tail-calls. Right now wasm just supports traps, which can't be caught or inspected from within wasm code, and don't actually need to record stack frames (but the JavaScript host does in web browsers). I'm not sure what benefit exposing the wasm part of the stack for traps gives the JavaScript host side except perhaps easing debugging? (though that is important, I think there are other valid solutions for that)

To be clear, I do want wasm to support the major language use cases, and I think implementing exception support is a good (though tricky, see the exceptions proposal) idea.

j-pb · on July 12, 2022

After reading the discussion, I think I'm much more in favour of adding tail calls, than adding exceptions.

Actually I wonder if one couldn't adapt the tail-call mechanism slightly to also serve as an exception mechanism. I think you could do both by allowing `br` to jump to an arbitrary label previously established. An exception then simply being a "tail-call" that unwinds more than one stack frame, and calls into the "catch".

Not sure how well JITable this would be, but I think an arbitrary backjump and call should be mostly fine in terms of control flow?

qorrect · on July 13, 2022

Well WTF is this doing https://docs.rs/wasm-bindgen/0.2.11/wasm_bindgen/fn.throw.ht... then ?

miloignis · on July 13, 2022

Calling into an external JavaScript function, which will then throw a JavaScript exception (as the linked documentation states).

Wasm cannot throw the exception itself, nor can it catch the exception. How exceptions are handled when raised in code called via FFI in another language are a property of the embedding and wasm engine, not of wasm itself, which (as of yet, barring that exceptions proposal) has no concept of them besides unrecoverable traps.

qorrect · on July 13, 2022

Great thank you for the info

the_mitsuhiko · on July 13, 2022

Would be interested in what other solutions you envision. In production error reporting to services like Sentry relies on getting access to a stack and unlike a native runtime there is nothing one could do to create a stack without the engine’s support.

miloignis · on July 13, 2022

If you control the compiler, you don't need the engine's support for exceptions (though you might want it for performance reasons) - emitting function prologues and epilogues that maintain debugging information like a call stack (or call trace in a ring buffer like some Schemes) would be one way.

I actually did this in my purely functional language, but had it off by default for performance reasons - if you hit an error at runtime, the debugging code would re-execute from the last checkpoint with debugging information turned on to construct the missing debug data, something I could only do because the language is side-effect free.

kragen · on July 12, 2022

It's more important for exception handling to be implementable than for TCE to be implementable, yes. But it's more important for TCE to be provided by the virtual machine (or, alternatively, for call and return to be manipulations of user-level state, as on traditional CPUs, rather than magic privileged operations as in Wasm) because you can implement exception handling with TCE but you can't implement TCE with exception handling.

pmontra · on July 12, 2022

Maybe I'm naive and I surely don't know anything about the WASM spec, but aren't CPUs implementing both exceptions and tail call elimination as gotos (some jump machine code instruction) ? Exceptions might have to pop some frame, TCE doesn't.

shadowofneptune · on July 13, 2022

To avoid some classes of security issue, and also because it started much closer to asm.js in features, WASM uses structured programming control flow like blocks, if..then, and break statements. Functions can only be called by the call instruction right now, so this proposal is needed to jump to an function body.

qorrect · on July 13, 2022

Are you saying there is no goto in wasm ? Genuine question I don't have any knoweldge of WASM.

miloignis · on July 13, 2022

Yes, that is correct.

dataangel · on July 13, 2022

Plenty of C/C++ code de facto assumes it’s present even though it’s not mandated. Anybody trying a list algo in recursive style is relying on it implicitly.

dataangel · on July 13, 2022

The point is to run compiled langs like C++ though. Technically x86, ARM, etc don’t directly support exceptions either, but I assume WASM also has to define an ABI between WASM modules, and ABIs usually are exception aware. Or is it some other reason they got added?

nu11ptr · on July 12, 2022

> Tail-calls is fundamentally something that the compiler _cannot_ solve

Possibly a stupid question as I haven't given this much thought, but I thought tail call elimination could be used to convert recursive calls in tail position into loops. Could a compiler not do this (like Scala does, for example)?

samatman · on July 12, 2022

Tail-call elimination applies to any function which calls another function as the last action.

Turning this into a loop is only possible when the function calls itself, not when it calls another function, and not (naively) when two functions call each other in the tail position.

nu11ptr · on July 12, 2022

Understood, but in practice how often does that really happen? (I'm aware of the contrived 'odd'/'even' example always given, but in the real world, I never had anything like that). Even when I wrote more functional code I rarely (if ever?) did that. 98-99% of the time it was the same function calling itself. Is your experience different?

lmm · on July 12, 2022

Iteratees are immensely useful in practice and they involve doing this all the time (basically you do stream processing with a source and a sink that are mutually recursive - the source emits an element by calling the sink to process it, the sink processes it and then recurses into the source for the next element. It feels a bit forced if you describe it like that, but it gives you a model wiht lots of natural structure - it's the only approach to stream processing I've found where you can have a value that represents a stage in the middle of processing a stream (not just a 1:1 function but something that does grouping or suspension or the like) in a natural way, that you can then test standalone etc.).

matheusmoreira · on July 13, 2022

> the source emits an element by calling the sink to process it, the sink processes it and then recurses into the source for the next element

I think I understood that concept but it still seems a bit strange to me. Doesn't that imply a potentially infinite stack of function calls?

I've seen this idea in other functional program examples as well. Instead of having a sequence of instructions, they have a sequence of function calls. Instead of a function returning to the caller, it calls the next function in the program which is analogous to the virtual machine moving on to the next line of code in the sequence. It implies the program's structure and state is actually expressed within the function call stack and its frames. I admit I'm not sure what the purpose of this is.

lmm · on July 13, 2022

> I think I understood that concept but it still seems a bit strange to me. Doesn't that imply a potentially infinite stack of function calls?

Yes, which is exactly why tail calls are so important. Otherwise you'd stack overflow after processing the first few thousand elements of your stream.

> I've seen this idea in other functional program examples as well. Instead of having a sequence of instructions, they have a sequence of function calls. Instead of a function returning to the caller, it calls the next function in the program which is analogous to the virtual machine moving on to the next line of code in the sequence. It implies the program's structure and state is actually expressed within the function call stack and its frames. I admit I'm not sure what the purpose of this is.

Well, if you want to express your function as an actual function, in the mathematical sense, recursion is often the natural way to do it. Things like the call stack are an implementation detail; often a natural way to describe a function is "do this, do that, then do the same thing for all the rest of it", and usually the easiest way to translate that into a program is recursion.

It also tends to be more compositional, because you don't have to worry about control flow (or at least, control flow isn't a special case: all your control flow is just function calls). For example if you are calling several nested functions in an imperative-style loop, there's no way to break out of that loop from the inner function. But if you write the same thing as several mutually recursive functions, you can "break out of the loop" by just not recursing under particular conditions.

matheusmoreira · on July 13, 2022

Thanks, I think I have a better grasp now.

dragonwriter · on July 13, 2022

> Understood, but in practice how often does that really happen?

In functional code written where general tail call elimination (or special tail call operations) are available, non- or mutually-recursive tail calls are not uncommon.

Heck, non-recursive (potentially fairly nested) tail calls are common in languages without any optimization for it. Anytime you see something like “return f(x)” (or, because operators are implemented via special methods, even “return x+y”) in Python, for instance, you've got a tail call.

If it isn't direct or mutual recursion, it has to be a pretty degenerate case to blow the stack even without elimination, so elimination is must necessary for those cases. But other cases bene for from not having the extra memory usage and reducing pushing stuff onto and popping it off of the stack.

dwohnitmok · on July 12, 2022

Happens a fair bit of time when parsing deeply nested structures (e.g. parsing deeply nested JSON structures) using mutually recursive parsers. I've generally resorted to either trampolining or explicit stacks.

kragen · on July 12, 2022

This is usually a case where TCE doesn't help, in my experience, because most of the calls to sub-parsers aren't in tail position. This is especially unpleasant when you get bug reports of mysterious segfaults that turn out to be stack overflows...

remexre · on July 13, 2022

Yep, a basic security check for any endpoint that accepts JSON: what happens when you hand it 4KiB of '[' characters

robocat · on July 13, 2022

> parsing deeply nested JSON structures

How likely is it that the depth of the JSON structure would exceed the maximum depth of the stack? This doesn't appear to be a compelling example to me.

It probably is a shame that it's called tail call optimisation since nobody wants to mandate low level details for an optimisation.

epidemian · on July 13, 2022

> How likely is it that the depth of the JSON structure would exceed the maximum depth of the stack?

Not that unlikely i'd say, at lest if the stack size is unreasonably small.

For example, on nodejs, JSON.stringify() chokes with a nested array of 10K depth:

  > N=10_000; JSON.stringify(JSON.parse("[".repeat(N) + "]".repeat(N)))
  Uncaught RangeError: Maximum call stack size exceeded
      at JSON.stringify (<anonymous>)

Curiously, JSON.parse() is OK with arrays of 10M depth (and probably more). A difference of over 3 orders of magnitude between functions that seem pretty symmetrical at first glance.

Imagine that you build some software that lets users nest things —e.g., a circuit designer, a visual programming language, a UI builder, etc— and you wanna save these nested structures as JSON, or process them recursively. Pretty natural fit i'd say. Now, as a user, i'd be pretty upset if my magnum opus of a very complex circuit, or giant visual program, or very busy beast of UI art, starts creashing the program when exceeding a mere 10K elements. My computer has gigabytes of memory! And, presumably, there's a lot still available. But nope, a meager stack can ruin all the fun.

I personally think this is a pretty silly limitation to have on software, especially given today's memory sizes.

foldr · on July 13, 2022

TCO likely wouldn’t help here, as the parser is probably using non-tail recursion to build the nested arrays. (I haven’t looked at the code, but that’s the natural way to do it with a recursive descent algorithm.)

dataangel · on July 13, 2022

It’s a security/DOS vulnerability if the object depth can be used to trigger stack overflow.

jhgb · on July 12, 2022

For example pretty much any higher-order function that wants to do some decision-making and then hand over execution to one of several functions provided will want to hand over execution by means of a tail call, so that it doesn't unnecessarily change stack complexity for whatever algorithm is being run around that piece of code.

bitwalker · on July 12, 2022

One example would be in a parser, as various parsing functions would be invoked recursively until end of input, or until some non-matching input is encountered. In pretty much any scenario where you would use recursive functions, it is quite common to have mutual recursion between multiple functions. Not always of course, which is where the tail call elimination optimization is typically applied, but not being able to freely use mutual recursion to arbitrary depth turns out to be quite an annoying limitation.

lilyball · on July 12, 2022

State machines can often be implemented with a set of functions that each handle a single state and then tail-call into the function for the next state.

kevin_thibedeau · on July 12, 2022

This is just a fancy form of spaghetti code. Don't do it unless you actually need the performance.

marshray · on July 12, 2022

State machines are known for a predilection to use `goto` extensively, but that does not make them 'spaghetti code'.

A state machine is, in fact, a well-understood method of applying structure to spaghetti.

kevin_thibedeau · on July 12, 2022

The point is that tail calls obscure the transitions between states by spreading it everywhere.

marshray · on July 12, 2022

I'm not too clear on how one would implement a state machine without "spreading the transitions everywhere". Can you share a link to some example code showing the format you prefer?

Typically, we'd see the code that implements the logic for each state gathered together. Since this is where the outgoing transitions get decided, the transitions end up being distributed across their originating states.

yuubi · on July 13, 2022

erlang/otp gen_fsm maybe?

you define a function with a clause per state (and some other args including the incoming message), and each clause returns a tuple with the new state and some other stuff. the loop is part of the framework, and handles generic otp stuff like updating code in a running system.

slashdev · on July 12, 2022

That doesn't compute for me. You'd need to give an example for me to understand you.

merb · on July 12, 2022

> 98-99% of the time it was the same function calling itself

well if the same function is in the "tail" it would still apply a tail call elimination.

But a lot of the time tail call eliminiation is only needed in some extreme cases. it can eliminate stack overflow exceptions by a big amount. C#'s new System.Text.Json would greatly benefit with tail call ops, because it often has a call chain that is recursive and does not call itself (its mostly like ReadObject, ReadObjectAsync, ReadObject, ReadBla, ReadObject, etc. and if the object has a ref to itself system.text.json often blows up.

In Java and Scala this thing can btw. also happen (and I've already seen it) however at least you can configure the stack size there and thus most often eliminate the problem somewhat.

it's an edge case, but one that would be cool to fix, since it makes code often more reasonable especially parsers.

kragen · on July 12, 2022

In languages like Scheme, Haskell, OCaml, and Erlang it happens all the time, often when the user isn't aware of it; even when these languages have special looping constructs they are often implemented in terms of tail recursion, sometimes mutual tail recursion between several anonymous subroutines.

In languages like Scala, Java, C#, and Swift, it basically never happens.

light_hue_1 · on July 12, 2022

> In languages like Scala, Java, C#, and Swift, it basically never happens.

Scalaz and cats use tail calls extensively and wouldn't be possible without them.

kragen · on July 12, 2022

How does that work on the JVM?

yw3410 · on July 13, 2022

Scala only does single method tail recursion and rewrites it into a while loop. Cats and other libraries use a technique called trampolining which basically moves the frames to the heap (or a mixed technique where they do recurse a certain depth on the stack before switching to trampolining).

kragen · on July 15, 2022

That sounds like it means that Scala doesn't support general tail-call elimination, so people don't write mutually tail-recursive code, instead transforming things that would naturally be expressed that way into a different structure that Scala does support.

yw3410 · on July 19, 2022

Yes. It can be a major PITA, in certain problem spaces and I think most people don't pretend it's a great solution but ultimately accept it as a limitation of being hosted on the JVM.

retrac · on July 12, 2022

In functional languages, recursive data structures are often most naturally walked with recursive functions. Languages like Haskell and ML take this to its natural conclusion and use it systematically. For example, a linked list. Here's how the Haskell "last" function, to get the last item in a linked list, is implemented in the standard library:

    last [x]                =  x
    last (_:xs)             =  last xs
    last []                 =  errorEmptyList "last"

If given a list with a single item, return that item. If given a list with two or more items, call itself with the list minus its head. If given an empty list, throw an error.

nu11ptr · on July 12, 2022

This is the scenario I am used to using where "last" is called in tail position calling itself, but I see from the other answers there are definitely valid use cases for non-same function recursion as well

User23 · on July 12, 2022

It's useful anytime you can benefit from the performance advantages of replacing a full fledged function call with little more than a jump. One classic example is implementing an efficient VM.

dmkolobov · on July 12, 2022

If you’re implementing a compiler for a language like Haskell, then everything is compiled into mutually-recursive tail calls.

bjourne · on July 13, 2022

If f1 calls f2 that calls f1 why can't the compiler inline f2 into the body of f1? It's not so complicated if the compiler has annotations or other hints to guide it.

samatman · on July 13, 2022

I once wrote an interpreter in continuation-passing style, meaning every final line of a function called the next function in the interpreter, passing a continuation and error continuation on.

Such a program never returns until it finishes interpreting, and can make an unlimited number of such calls.

This is a bad idea without TCE.

bjourne · on July 13, 2022

Why couldn't the compiler inline the whole interpreter?

samatman · on July 13, 2022

Because in the continuation-passing style, the functions-calling-functions are circular in many places, inlining would explode rather than halting.

dunham · on July 12, 2022

You can turn it into a loop with a case statement inside if you collect the connected component of tail called functions and make those the cases. But it’s gonna be a big function.

You can split it up a bit if you use a trampoline, but then you lose some efficiency.

contravariant · on July 13, 2022

Though at that point you're basically just one step removed from making your own stack frame.

Which would definitely work, no doubts about that, but it might be nicer to just get it handled at a more fundamental level.

lmm · on July 12, 2022

Scala has a weak version that handles self-tail-calls only. It causes significant trouble for e.g. iteratee/streaming libraries in Scala (like FS2) where you always have to use a trampoline.

OskarS · on July 12, 2022

I don't quite get your point here. Suppose I write a compiler to compile Scheme to WASM. I would have write it so that it does proper tail calls, otherwise it wouldn't be Scheme. What's stopping me? Would the browsers not run the code? Like, yeah, it would change .stack, but who cares?

Same thing applies, I think, to any other language compiled to WASM. C/C++ compilers regularly inline huge amounts of the code when optimizations are turned on, as well as do tail-call optimization. I haven't tried, but I would assume that they do that for WASM just as they do for x86 or ARM or whatever other build target I choose. As long as my users are ok with this (and they presumably are, otherwise they wouldn't turn optimizations), what's the problem, exactly?

kragen · on July 12, 2022

> I don't quite get your point here. Suppose I write a compiler to compile Scheme to WASM. I would have write it so that it does proper tail calls, otherwise it wouldn't be Scheme. What's stopping me?

Stack overflows are stopping you, if you implement Scheme calls as Wasm calls.

In Scheme, tail call elimination is not merely an optional "optimization", so the compiler can't opt to not do it when it's inconvenient.

fulafel · on July 13, 2022

The control flow in WebAssembly virtual machine is quite restricted (it's called "structured control flow" in the spec), you can't jump to code freely. Possibly you could compile scheme with having the whole program in one big wasm-level function and branching between blocks there?

marshray · on July 12, 2022

> yeah, it would change .stack, but who cares?

You might enjoy participating in an interoperable standardization process sometime.

OskarS · on July 12, 2022

I don't mean to minimize anybody's work, I'm genuinely curious about why it can't be done. Like, what's stopping me from making a compiler that optimizes tail calls?

For instance: the C and C++ standards have very strict rules for how to handle floating point math, and compilers aren't allowed to deviate from that according to the standard. Which turns off all sorts of cool optimizations you can do. But of course, all modern compilers implement some version of "-ffast-math" which turns off those rules and allows for the optimizations. It's no longer standard C/C++, but that doesn't mean that switch can't exist. The compiler is a computer program, it can output anything it wants, regardless of what the standard says. Nobody is going to go to jail because you turned on -ffast-math. The code will still run just fine.

So, my question is, why can't you write a compiler with an option that's like "I don't particularly care that .stack changes, do the tail call optimization". A -ffast-math, but for tail calls. Is there a technical reason why you can't do this?

light_hue_1 · on July 12, 2022

> So, my question is, why can't you write a compiler with an option that's like "I don't particularly care that .stack changes, do the tail call optimization". A -ffast-math, but for tail calls. Is there a technical reason why you can't do this?

Because the stack in web assembly is only observable. It is implicit. You cannot modify it yourself. There are simply no instructions for this. The WASM stack is not present on the WASM heap, like it is for your physical machine.

You simply cannot express tail calls with the instruction set given to you by WASM right now. No hacks are possible.

aaron_m04 · on July 13, 2022

Is this the reason clojure added the recur form? IIRC, the JVM doesn't support tail calls because Java doesn't need it.

https://clojuredocs.org/clojure.core/recur

JohnHaugeland · on July 12, 2022

> Tail-calls is fundamentally something that the compiler _cannot_ solve.

I don't see why. Compilers are how tail calls are literally always implemented. It's not like there's hardware support.

What makes this impossible?

.

> The issue is that to avoid stack blow-up you need the engine to recycle stack frames.

I mean, what's stopping you from just implementing a trampoline?

.

> The problem with this is that in WebAssembly (and JavaScript) the call stack is observable via the .stack property of thrown exceptions.

Okay? This seems fine to me. What makes this a problem?

dragonwriter · on July 13, 2022

> Compilers are how tail calls are literally always implemented.

WASM doesn't have the same set of operations as a typical CPU. It's not something a compiler to WASM can do.

int_19h · on July 13, 2022

Compilers implement tail calls on architectures where you can JMP with impunity. Wasm is not such an architecture - it deals with calls and returns and call stack as concepts.

naasking · on July 13, 2022

Trampolines are a hack that penalizes the performance of any language that uses tail calls a lot, like functional languages (scheme, lisp, Haskell, etc).

You could also handle exceptions manually too, without VM support, but that too would incir considerable overhead.

nynx · on July 12, 2022

.stack definitely seems like a mistake in the context of Webassembly. I’d be interested in seeing the justification for it.

aseipp · on July 12, 2022

Because people want a call stack when an exception is thrown, so they can print it out, just the same way they do in JavaScript and every other language.

EDIT: Also as another commenter mentioned, this is a property exposed by the VM engine to the host environment, not something directly observable from within WebAssembly itself.

WastingMyTime89 · on July 13, 2022

> Tail-calls is fundamentally something that the compiler _cannot_ solve. Trust me, if there was a way we would have avoided ourselves all this work.

I have been out of the loop regarding compilers development for a long time but what prevents you from converting your program to CPS and using a trampoline like some LISP still do?

antonvs · on July 13, 2022

A trampoline is a workaround that wastes memory and time. The idea is to avoid that so that you can use function calls efficiently.

anfilt · on July 12, 2022

What? Compilers handle tail calls all the time even in languages like C and higher level functional languages.

Its just a jmp? Does the wasm VM not have a jmp instruction?!?

Jtsummers · on July 12, 2022

https://www.w3.org/TR/wasm-core-1/#control-instructions%E2%9...

Those are the list of control instructions. WASM seems to be a bit of a misnomer as it is not an assembly language in the more conventional sense. It is a structured language and provides a limited goto in the form of branches (in the section I linked to) which are constrained to a particular scope.

If you compile your whole program so it fits within a single function, then yes you can use this to do universal tail call elimination. Otherwise you are restricted to doing TCE only on auto-recursive functions and maybe mutually recursive functions if you have a single entry point (of the set of functions) and decide to optimize by moving all of them into one function.

Otherwise, function calls are performed using one of the two call instructions which (presently) implement a behavior more like conventional call stack/stack frames. This proposal would add a second pair of call instructions that a compiler can emit which the WASM runtime would then optimize (by not generating new stack frames).

anfilt · on July 13, 2022

So no self modifying code either, since it seems like I can't just get the address of a function and modify the byte code?

shadowofneptune · on July 13, 2022

That would be used most often for malware, unfortunately.

jjtheblunt · on July 12, 2022

> Tail-calls is fundamentally something the the compiler _cannot_ solve.

How does the famous 1977 Guy Steele paper on compilers optimizing tail calls not apply?

https://dl.acm.org/doi/10.1145/800179.810196

taeric · on July 12, 2022

My guess is that this is assuming that the compiler can rewrite a return into a jump. That is, these compilers have control over the stack on the machine.

jjtheblunt · on July 12, 2022

compilers always have control over the stack on the machine: they generate the code which realizes the notion of a stack (and happily usually have dedicated registers to use for that generated code)

taeric · on July 12, 2022

Not something compiling to WASM. Nor something compiling to JVM Bytecode.

jjtheblunt · on July 12, 2022

If WASM and JVM Bytecode have a goto, then the compiler can use it. Anyway, if you've never read that paper i linked to above, i do believe you'll find it mind-blowingly cool stuff. not kidding.

Jtsummers · on July 12, 2022

WASM's version of a goto is a branch, it is constrained to a limited scope. This creates a problem for general tail call elimination but not for auto-recursive functions.

Auto-recursive functions can switch to using a branch or using another loop construct, and the compiler can generate that WASM code. The control structures available are described in the link below. Mutually recursive functions where only one is meant as the entry point could be similarly converted into WASM instructions but where the additional functions are, essentially, eliminated and inlined into the original caller (this may not be a good general solution, though). But general tail calls cannot be converted into gotos as you might with a more conventional assembly language since WASM branches cannot go to arbitrary instruction addresses.

The problem is that in a conventional assembly language, if you do TCE then the goto would go to (har har) the same (or nearly the same) point as a regular function call (maybe it skips the first few instructions depending on the calling conventions involved). With TCE and WASM you'd have to, essentially, inline the called function, or a variant of it, in order to be able to branch to the start of it. You cannot branch into a different function or to the start of a different function, you have to call the other function and then you have a single entry point. Which, in WASM as currently specified, means you have to use the current calling conventions which does not permit TCE.

https://www.w3.org/TR/wasm-core-1/#control-instructions%E2%9...

taeric · on July 12, 2022

They don't. That is literally the thing.

Edit: I should say, they don't have anything equivalent that can jump out of the method you are in.

Edit2: This is a bit more clear if you consider what it means for JVM bytecode to have a "return" set of instructions. Why does the bytecode need a return, if that is all managed by code that the compiler should handle anyway? You can look at the instructions here: https://en.wikipedia.org/wiki/List_of_Java_bytecode_instruct.... Note that it is the "jsr" instructions that let you do the equivalent of a long jump, and those specifically manipulate the stack. There is a "goto", but it is not valid to have that jump outside of the subroutine you are in.

convolvatron · on July 12, 2022

this paper espouses the use of jump+arguments instead of call. Wasm has no control transfer instruction that doesn't use the stack, nor does it have writable access to the stack pointer.

randomswede · on July 13, 2022

There are probably WebAssembly-specific things I am unaware of. So, the following is a general "TCO" discussion.

I normally consider TCO something that is a compiler feature. Replace a call to the head of yourself, with a jump (possibly to just after the "pull the arguments from the call stack to where the code wants it" prologue), making sure that the correct locals are present where they need to be.

Well, that's for self-TCO. General TCO is definitely trickier (probably requires juggling stack allocations so as to ensure that the tail-called function has enough space for all that it needs).

skrebbel · on July 13, 2022

Wouldn't it be much more productive to do a campaign to get `.stack` out of WASM?

miloignis · on July 12, 2022

Throwing in my 2 cents to agree with apignotti - as someone who has implemented a compiler for a functional language that emits wasm, it is not possible to solve this performantly in the compiler.

Because wasm only has structured control flow & no way for user code to modify the wasm stack, there isn't any good way to tail call between multiple independent functions, particularly dynamic function calls. Simple tail recursion of a function calling itself is simple enough and can be implemented in the compiler (and I did), but that's it, and not enough to correctly implement Scheme, for instance.

I also want the wasm standard to remain as simple as possible, but this isn't a very complex addition and it is required for reasonable performance for both functional languages as well as C++20 features like coroutines, as mentioned by the article.

emmelaich · on July 12, 2022

Can you please clarify? apignotti wrote

>Tail-calls is fundamentally something that the compiler _cannot_ solve.

You write

>it is not possible to solve this performantly in the compiler

So is it fundamental or not? Apologies if the question seems direct or rude.

miloignis · on July 12, 2022

No worries - it depends on what your constraints and context are. It's fundamentally something a wasm-emitting compiler cannot solve performantly in general. You can use trampolines, as the sibling comments mention, but you'll take a heavy performance hit.

If we stretch things too far we'll end up in the "wasm is Turing-complete and thus can run anything" type of territory - if you're doing something performance intensive, say x86 virtualization like I believe apignotti is at Leaning Tech, I think it's fair to describe the problem as unsolvable by the compiler.

emmelaich · on July 12, 2022

Thanks, that's pretty much what I suspected. I guess the issue is that the term (fundamental) is being used in a specific context assumed by experts in that area.

User23 · on July 12, 2022

> So is it fundamental or not?

Obviously a Turing-complete system can simulate any other Turing-complete system. For example you could write an emulation of any system you like in WASM, and in the emulator you could have tail call optimization. Thus from context it should be apparent that we're talking about efficient implementations.

convolvatron · on July 12, 2022

'call' leaves return state on the stack, so if you use it to implement 'jump'. that has to be cleaned up. a trampoline is kinda the only choice if you are forced to run on a stack. go ahead and use 'call' as much as you like, but as the useless return addresses (and often locals) pile up on the stack, at some point you just return all the way back (or do a longjmp equivalent if your environment supports it) and start over again.

this is measurably worse than just using jump, and as another posted pointed out, can introduce an O(n) term that doesn't need to exist. (edit: nevermind - if you are going through the forward direction n times then it doesn't change complexity to do a little more n work on the way out)

thayne · on July 12, 2022

It might be possible for a compiler to generate code that effectively does trampolining. But that would definitely have a performance cost.

ufo · on July 12, 2022

Wasm is too high level to implement your own tail calls. The WASM virtual machine handles the call stack & function calling convention, instead of being something that the code itself is responsible for. This means that the compiler can't implement tail calls; WASM doesn't allow a jump instruction in one function to jump into a different function.

There are a bunch of reasons why WASM was designed to be higher level than native assembly. One is that they want it to be more compact to save bandwidth when downloading via the internet. Another is that they want to make it faster to compile down to native code; one way they do this is that the control flow is more structured, for example it forbids irreducible control flow graphs. There is also the matter of safety. WASM defines several validation constraints that are checked before the bytecode is run e.g.: the target of a jump instruction must be listed in labels list. If WASM were too low level, it wouldn't be possible to do the safety checks they want.

MikeHolman · on July 12, 2022

The compiler absolutely can implement tail calls, I don't know why this keeps getting thrown around. Adding a high-level directive in the spec doesn't enable the compiler to do anything, it just enforces it. The only thing preventing it is browser vendors wanting the .stack property to stay well behaved, but that isn't required by the spec and certainly isn't relevant for non-browser targets.

xenadu02 · on July 12, 2022

Most hardware architectures (and software ones based on them) the compiler controls the calling convention: how the stack is managed, what gets pushed there vs passed in registers, and so forth. The architecture may or may not have helpers like specific "return from subroutine" instructions that help manage the stack... but a lot of things are under the compiler's control.

WASM is not like that. It doesn't support jumps or that kind of manipulation. It does not expose enough control over the stack and calling conventions.

In theory could the compiler emit a virtual machine that simulates another machine where it _can_ control these parameters? Sure. It can rewrite it all into a huge loop-and-switch statement. But that is not going to result in anything close to efficient native code. By expressing tail calls directly in WASM the JIT can generate much more efficient code that takes advantage of the platform's native calling convention and tail calls can be _actual_ tail calls on the hardware.

light_hue_1 · on July 12, 2022

This is simply false.

The compiler cannot implement tail calls correctly as it stands. You do not have access to modify the WASM stack and it's not present on the heap like it is for normal programs.

No compiler tricks can enable tail calls in WASM at the moment (with the exception of trampolines which always work and are absurdly slow).

wtetzner · on July 12, 2022

I think they're referring to a compiler that is targeting WASM, not a WASM-to-machine-code compiler.

kragen · on July 12, 2022

Your concern with burgeoning standards reinforcing the browser duopoly is well-founded, but unfortunately you're sticking your stake in on the wrong side of the dragon here.

It sounds like you're confused about how Wasm works, and you imagine that it's like a normal assembly language, with calls and returns implemented by the compiler using instructions that push and pop a stack in memory. You're probably right that that's how it should have been designed.

But that is not how Wasm works. The stack is not in Wasm's "linear memory". (This makes it much easier to compile access to local variables efficiently.) Calls, returns, and in a sense exceptions in Wasm are provided by the virtual machine at a fundamental level in a way that makes it impossible for a compiler targeting Wasm to implement tail calls, call/cc, or cooperative multithreading, except by compiling your entire program into a single function, which ruins the performance of existing Wasm implementations.

But, as Scheme has demonstrated, once the virtual machine supports tail calls, a compiler can implement exceptions and cooperative multithreading by way of compiling to continuation-passing style. So this is not the beginning of a long series of extensions; it's Lambda the Ultimate Goto.

dan-robertson · on July 13, 2022

The purpose of the wasm rules (eg declaring variables/function args instead of having push/pop operations) is to allow the compiler to efficiently determine that functions follow the calling convention and won’t be able to mess with non-wasm functions they call or are called by. It means that you can put return pointers on the stack without the wasm being able to ever touch return pointers.

light_hue_1 · on July 12, 2022

> But, as Scheme has demonstrated, once the virtual machine supports tail calls, a compiler can implement exceptions and cooperative multithreading by way of compiling to continuation-passing style. So this is not the beginning of a long series of extensions; it's Lambda the Ultimate Goto.

Well to an extent. CPS can be extremely slow.

Scheme shows that continuations will also be needed down the road. But yes, that's about it.

naasking · on July 13, 2022

IIRC, one shot continuations can be made pretty efficient, effectively folding back into a single stack. If there's a simple way to safely express this in wasm that might be the right balance. If you want multishot continuations you'd need to explicitly clone the one-shot continuation.

yw3410 · on July 13, 2022

I think koka uses a different approach for its multishot continuations than scheme as well.

kragen · on July 12, 2022

In theory CPS is very fast indeed, but of course our hero can easily collapse attempting to cross the chasm of the Sufficiently Smart Compiler. I suspect Chicken using CPS for everything is the main reason http://canonical.org/~kragen/urscheme, which doesn't implement call/cc, is so much faster than Chicken.

munificent · on July 12, 2022

> Reference types maybe. But I've yet to see a convincing argument as to why they are absolutely necessary.

Are you referring to WASM GC here (or managed memory in general)?

If so, I think the main compelling reason is interop. Without GC in underlying architecture, any managed language targeting WASM must include its own runtime and GC. If you then want to write a polyglot program using multiple languages, you now have a single WASM executable containing multiple garbage collectors each managing their own objects. If you end up with cycles between those different managed heaps, you can end up in a situation where the GCs are unable to reclaim memory because they don't understand cross GC cycles.

This is why Chrome did Oilpan [1]: so that they could more easily handle cycles between the DOM, JS heap, and (aspirationally at the time) Dart VM heap.

A more general user-value proposition argument is that putting the GC in the WASM implementation means any improvements made to it are amortized across all managed languages that target WASM. Also, it makes applications written in managed languages and compiled to WASM smaller because they don't have to ship a runtime and GC with the app.

[1]: https://chromium.googlesource.com/v8/v8/+/main/include/cppgc...

smolder · on July 12, 2022

I'm likely in the minority in my thinking, but I don't think targeting wasm with managed languages is such a great use case for wasm anyway, and so it wasn't worth building in GC to support. For the times you want to, something light like reference counting should be easy enough.

munificent · on July 12, 2022

I'm biased since I work on a managed language that targets the web, but I strongly believe that it is a great use case for WASM.

Most programs targeting WASM are client-side apps with rich user experiences and I believe all but a small fraction of users would be better served implementing rich UIs in managed languages. Doing UI work in C++ (which I have done plenty of) is a special exercise is pain for almost no upside. GC is great.

discreteevent · on July 12, 2022

I went back to that pain a few years ago after many years away from it. After a few months I got right back in the swing of things. Spending my days arranging and solving little memory ownership problems.I got to thinking: "This is programming". And I didn't really mind it.

But every so often I had to be honest and remind myself of what I had noticed at the beginning: Thinking about and managing memory is a colossal waste of time, even with smart pointers and the like - The machine can do that and has been able to do it efficiently for most things for a long time.

smolder · on July 12, 2022

JavaScript is already a GC'd language effective at UI, so it doesn't seem to me like there's a strong need for wasm to fill the same role. Rather, it seems to me like wasm is better suited to accelerating processing-heavy tasks those rich web apps depend on, like custom [de]compression, ML models, data processing, image processing or simulation of some kind.

jacobp100 · on July 13, 2022

If your language uses a tracing GC, you can’t just add reference counting and expect existing programs to work without memory leaks

jhgb · on July 12, 2022

> I'm a huge functional programming evangelist, but high-level stuff like this does not belong in a low level language bytecode like WASM.

A tail call is a jump. Jumps are not "high-level stuff". They're very simple instructions. If WASM can't do a jump, it is actually WASM that you can't call "a low level bytecode".

Spivak · on July 12, 2022

WASM doesn't have jumps [1]. And it's not as low-level as you might expect. But even still you're assuming that low level means a specific model of computation a la PDP-11 that's as fictional as any other.

https://webassembly.github.io/spec/core/syntax/instructions....

msla · on July 12, 2022

WASM isn't particularly low-level, in that it's a compiler's IR, with a few concomitant restrictions on how control flow is allowed to work. Here's a piece on those restrictions, and how they make implementation more difficult:

http://troubles.md/why-do-we-need-the-relooper-algorithm-aga...

> WebAssembly isn’t designed as a front-end language for general programming though, so what’s the problem? Well, WebAssembly does have some constraints, specifically it must produce code that is valid. WebAssembly is a stack machine, and you can’t jump to just any label since that label might point to code that pops too many values off the stack, or pops the wrong types off the stack, or pushes too many values onto the stack.

Also, to summarize another part, when a compiler targeting WASM sees control flow WASM can't directly express, it has to implement it in a loop-and-switch form, something called the Relooper algorithm; therefore, compilers taking WASM down to native code have to understand that, and undo it back into a performant form. This adds up to a lot of work most VMs don't require, hence the proposal to move WASM towards a control-flow model friendlier to tail-call optimization, which would bring it more in line with physical hardware.

jhgb · on July 12, 2022

I never expected WASM to be low-level to begin with, so I don't believe it could ever possibly be "as low level as I might expect". Something like NaCl (the original one) perhaps could have been called "low level".

> But even still you're assuming that low level means a specific model of computation a la PDP-11 that's as fictional as any other.

I don't see how jumps are "fictional". Plenty of CPUs have them. Even actual stack CPUs, for that matter.

Spivak · on July 12, 2022

I feel like you're ascribing some judgement to the word fictional, it just means that machines, real or virtual, present programs a model of computation that is an abstraction over the implementation. One such abstraction is the notion of a program that is a block of instructions executed in sequence with a program counter, registers, memory, a stack. But that's not the only abstraction you could imagine. The JVM for example eschews manual memory management. And WASM eschews the concept of a program counter, and arbitrary jumps, and instead only allows structured control flow.

jhgb · on July 12, 2022

I feel like you're using the word "fictional" in a sense that renders all models of computation "fictional" to the same extent and hence renders the word "fictional" itself meaningless/uninformative. Personally I feel like there should be some distinction between features depending on how much supporting machinery you need to physically implement them. To me at least, "low-level features" are those that require the smallest amount of physical machinery to implement them. Things like bitwise operations, indirect addressing, arbitrary IP adjustments, etc. Definitely not things like enforcing stack discipline, enforcing block structure and such. (Just so that you know what I mean when I say "low-level". Opinions on that may differ very wildly, of course, and it's perfectly possible that other people use that term differently.)

I'm not saying that "fictional" things are bad. Tail calls themselves are "fictional" to me in the sense that all tail calls are jumps but not all jumps are tail calls, and the distinction between a tail call and a general jump is too difficult to make for a reasonably sized physical machine and best left to the compiler. But that is not me being judgmental against tail calls, of course -- I love tail calls and consider their lack a huge design failure wherever they're absent.

shadowgovt · on July 12, 2022

Jump as a concrete implementation is easy to understand (it's just math on a program counter).

But code is both a mechanical implementation and an abstract, often mathematical, concept. In the concept space, jump is as abstract as call, variable assignment, operator evaluation, etc. A Von Neumann machine is but one way to implement an abstract mathematical / conceptual machine.

In the mathematical space, there's no "high level" vs. "low level," there's just "features a language has" vs. "features it does not." There are other abstractions that are actually harder / impossible to do correctly with jump in the language (stack unwinding comes to mind, which is why C++ solves the exceptions vs. setjmp / longjmp dichotomy by... Not solving it, and a program that uses both will just do something undefined. C++ without setjmp / longjmp would be a language with fewer undefined behaviors).

munificent · on July 12, 2022

> Jump as a concrete implementation is easy to understand (it's just math on a program counter).

This doesn't take into account what happens to memory on the stack if you jump into or out of the scope of a local variable. You can say that memory and initialization is not the implementation's problem and rely on the compiler to emit instructions that correctly adjust or initialize the stack before and after jumps. Then, yes, you get a very simple jump implementation.

But you also get an architecture that must either do stack analysis before it can safely run code, or you get an unsafe architecture. Those are both valid choices (the JVM does the former and native architectures do the latter), but there are real trade-offs with them.

A third way, that WASM does, is to say that control flow isn't simple, but that you get safe stack management for free with it.

jhgb · on July 12, 2022

Possibly, but when I talk about "low-level stuff", I usually imagine very concrete things that are suitable for physical hardware implementation. Things like what the simplest RISC-V chip is designed to do. Unlimited jumps definitely are one of those things. Notions of stack frames or block structured languages (limitations on jumps) etc. definitely aren't among them. So I couldn't possibly ever consider something like WASM to be "low-level", because it places constraints on your code that no reasonable physical CPU would ever enforce.

[EDIT, from a deleted comment of yours:

> The RISC-V chip is in the same family as the PDP-11 architecture. The Turing machine itself requires no JMP instruction, so it is possible to build a CPU without one. Has anyone done so? In general no, because the PDP-11 had profound impact on the world of physical computing. But one does see it from time to time; GLSL, for example, has no `goto` instruction because the graphics card shader hardware it was built to operate on does its magic by running exactly the same instructions on multiple data; jump would imply the ability for one of the instruction flows to be different from the others in its processing cell, which the hardware cannot support while maintaining the throughput it must to accomplish its task.

I get what you're trying to say here, but sort of a guiding principle for me here for many years has been what I call "the principle of minimal total complexity". While you can keep making the CPU simpler in many cases, if that causes your program to become excessively long in the sense that the total size of the description of your program AND the device it's running on starts actually increasing, that's a place you don't want to design yourself into in practice. In case of sequential machines, I don't consider removal of features that makes programs unnecessarily long "making the machine even more low-level", that's just "making the machine dumber" to me. Feel free to look at the Oberon system (both HW and SW) to see what I have in mind by that -- that seems to be about as minimal a complete system as I can imagine in practice.]

shadowgovt · on July 12, 2022

To be clear, I deleted that comment because it was inaccurate. GLSL has break, continue, and return, which are flow-control statements. It excludes goto, and my explanation for why it excludes goto is probably incorrect (but I don't have time today to rabbit-hole on why goto was excluded or why the SIMD architecture can support break and continue just fine while excluding goto).

> In case of sequential machines, I don't consider removal of features that makes programs unnecessarily long "making the machine even more low-level", that's just "making the machine dumber"

You may be interested to consider how incredibly complex the modern x86 architecture is to implement because it supports sequential program execution as a core invariant principle. As a result, modern computers (which strive to be faster than a PDP-11) have to do a massive amount of work to parallelize that sequential code because parallel execution is the only frontier of fast computation that remains. They literally rewrite the opcodes on the fly into something that can be SIMD'd and do branch prediction, where code is run speculatively just to discard the result. All to support the idea that deterministic flow control should be possible in 2022. It's a brilliant fantasy the chipset manufacturers have constructed for us so we don't have to reframe our thinking.

I think I get what you're saying, but in modern times calling jump "low level" (or, for that matter, calling branching in general low level) is a very "Do you think that's air your breathing?" kind of position. I'm not aware of anyone seriously considering approaching the challenge of all this implementation complexity by throwing out fundamental assumptions of our PDP-descendant instruction sets and saying "Here's a new machine code, it's designed for parallel execution, the first assumption you must throw away is the order in which any of these instructions are executed."

But I suspect we're getting very close to that day.

jhgb · on July 12, 2022

Maybe it was inaccurate but I got what you were trying to say -- that "incoherent execution" is difficult on SIMD machines.

> You may be interested to consider how incredibly complex the modern x86 architecture is to implement because it supports sequential program execution as a core invariant principle. As a result, modern computers (which strive to be faster than a PDP-11) have to do a massive amount of work to parallelize that sequential code because parallel execution is the only frontier of fast computation that remains.

I'm aware what recent CPUs do with ISA instructions. That flies very badly in the face of the minimum complexity principle as well. The amount of physical resources dedicated these days to making your legacy code run just slightly faster is exceptionally wasteful.

> but in modern times calling jump "low level" (or, for that matter, calling branching in general low level) is a very "Do you think that's air your breathing?" kind of position

Well, it's low-level for the kind of minimum complexity system that I'd consider ideal. Not necessarily for the abominations forced on us as an accident of history.

shadowgovt · on July 12, 2022

You might be interested in a fascinating game called TIS-100 by Zachtronics. The game is a lot of things, but a core premise is that it imagines a computer from a time approximately parallel to the PDP-11 that took the form of small compute components that were connected to each other instead of a monolithic central processor. It's a fun game, and it sort of raises the question of which is the "abomination[s] forced on us as an accident of history." Because when you look around at most of the biological world, you see heavily distributed systems with some centralization, but computers (man-made things that they are) are heavily centralized, clock-locked, deterministic... And energy-intensive. And slow.

History is arbitrary but not random, and it's fun to think about how things might have been different if the first machines started embarrassingly parallel with follow-up work to consolidate the data instead of embarrassingly centralized with us now in the era of how to make the monoliths fast. It's interesting to think about what's "ideal" about a machine that supports arbitrary jumps (and the global address space that demands, and the sequential execution necessary to prevent decoherence, and the memory protection demanded because some addresses are not executable code and should never be executed, etc., etc.).

jhgb · on July 12, 2022

> and it sort of raises the question of which is the "abomination[s] forced on us as an accident of history."

What I meant by that is the legacy of AMD64 having thousands of instructions, many of them with arbitrary opcode encoding, half-assed SIMD ISA instead of a proper vector ISA, and the ability to emulate an 8086, all purely for reasons of backwards compatibility. If you started designing a computing ecosystem completely from scratch, surely you wouldn't end up with an AMD64-based IBM PC descendant as your best idea you could come up with?

saagarjha · on July 12, 2022

This would be kind of a sad direction to go in if we didn’t have any constraints, since we already have new architectures going in this direction anyways. When you take the crust of the ISA off though all processors look similar and that’s where the interesting bits lie.

ilaksh · on July 12, 2022

I don't think web assembly is low level at all. Actually the name is confusing because it's really not much like an assembly language.

mibsl · on July 12, 2022

Compilers can easily lower tail recursion into loops, but not general tail calls between arbitrary (possibly unknown, indirect) functions.

The best they can do are trampolines, which come with a high performance cost.

peterkelly · on July 13, 2022

> Wasm should only care about two things: Security and Performance

For languages that rely heavily on tail calls (functional languages in particular), good performance can only be achieved by having tail calls (or general-purpose jump instructions) as part of the instruction set.

Excluding tail calls hurts performance for these languages as they have to resort to techniques like trampolines. As apignotti suggested, it's impossible for a compiler to generate efficient code for tail calls if the underlying instruction set does not support it.

LAC-Tech · on July 13, 2022

Not a compiler expert, but don't a lot of VMs for languages with TCO have a special tail call instruction? I know Luas VM does.

https://www.lua.org/source/5.4/lopcodes.h.html

gumby · on July 12, 2022

Webassembly doesn't have a goto.

There are some special cased scoped branches but if you read the spec you'll see they are so specific that you can't branch back to the head.

(ex compiler dev)

usrn · on July 12, 2022

WebAssembly is actually very high level. You need tail call support if you want to preserve the original structure of the program (which seems to be the intention.) I think you can get around it with eg CPS but then you'd get something more like traditional assembly.

afiori · on July 13, 2022

There are two compilers involved, Source to Wasm and Wasm to native (usually JIT).

The second could easily optimize tail calls but it is required not to do so to increase reliability (all implementations need to agree which calls are tail calls that should be optimized and the simplest way to agree is to never do it) as otherwise a Wasm modules could work fine on an implementation and be unusable in another.

The first compiler cannot, the only possibility is to use whole-program transformations that significantly degrade performances.

dan-robertson · on July 13, 2022

An example of a tail-call that can easily be compiled to a loop is:

  let sum list =
    let rec loop a = function
      | [] -> a
      | x::xs -> loop (a+x) xs
    in
    loop 0 list

An example of a more tricky case which may be compiled to a loop is the following mutual recursion:

  let rec even_then b n =
    if n = 0 then b else odd_then (not b) (n - 1)
  and odd_then b n =
    if n = 1 then b else even_then (not b) (n - 1)

And this may be compiled either to a loop which uses a trampoline (or switch on the next function to run, which is a bit like a static trampoline) or to two separate functions that duplicate some code and are implemented as loops. A simple way this might get compiled is:

  fn even_then_or_odd_then(which_function, b1, n1, b2, n2) {
    loop {
      match(which_function) {
        Even => {
          if n1 = 0 { return b1 }
          else {
            b2 = not(b1);
            n2 = n1 - 1;
            which_function = Odd;
          }
        },
        Odd => { similar }
      }
    }
  }
  
  fn even_then(b,n) {
    even_then_or_odd_then(Even, b, n, null, null)
  }
  
  ...

An example of a tail-call that is hard to compile would be in a parser combinator library for example:

  let parse_both p1 p2 =
    fun buf pos success ->
      p1 buf pos (fun pos r ->
        p2 buf pos (fun pos r2 ->
          success pos (r, r2)))
  ;;

(As I’ve written it this design has some issues but they aren’t so relevant). Here it is important to be able to tail-call a function that you take as an argument (e.g. you want the parser p2 to tail call the success callback you give it, and if you parse and arbitrary-length list you want to do it tail-recursively). Speaking more generally, general tail-calls are like a safe long-distance goto. Wasm doesn’t have goto and so a whole-program transformation (or a trampoline[1] which may have performance issues) is required to support tail-calls.

[1] A trampoline means that functions effectively return continuations except typically they call their continuations (up to some max number of times) and only occasionally return them to the trampoline (which pops everything off the stack) which immediately calls the continuation again. But implementing continuation passing style this way is pretty slow.

marcosdumay · on July 12, 2022

The problem here is that for security, your interpreted code can not do whatever it wishes on your memory. And for performances, you can't validate every small memory operation your code does.

The result is a fairly high-level VM that doesn't export enough power for your program to implement GC or exceptions without sacrificing a lot of performance. Tail calls is a different matter, and there is a cost-benefit analysis for it.

wruza · on July 12, 2022

We shouldn't have GC, Exceptions or Tail calls in WASM, as long as the compiler can provide them.

I’m not a compiler guy, but spent my time with runtimes closely, and can say that befriending GCs/RCs, exceptions, continuations, etc between them was my least favorite part.

That said, wasm is already a potential target for different runtimes built with no common vm in mind, so interop headaches are imminent either way.

kamaal · on July 12, 2022

I think its Rich Hickey who said something to the effect that tail calls are so fundamental that the underlying platform should be providing them.

convolvatron · on July 12, 2022

tail calls are so fundamental that its trivial to build a call-push, return-pop stack calling protocol on top of them, but not the converse

kaba0 · on July 12, 2022

Could you point me to some further info on this?

convolvatron · on July 12, 2022

sorry, its kind of ... basic?

the kind of generalized tail call I'm talking about is generally referred to as a continuation. its easiest to think of it as a jump. so you can imagine that if we have jump and a stack register, we can push+jump and thats the same as 'call'. we can pop+jump and thats the same as 'return'. since 'call' and 'return' have side effects, we cant really use them to replace jump.

so thats really straightforward, but things do get a bit screwy. if we can support arbitrary jumps, and the frame storage (arguments + locals) cant necessarily go on a stack because our frames might have arbitrary liftimes, so we cant reclaim them in stack order.

so in a scheme for example we just pull in a gc and track references to these closures and dust our hands off with a smirk. if that doesn't work for you then you have to adopt some kind of framework that lets you know explicitly when these can be released. reference counting is a poor choice here because these references between frames can easily be cyclic.

morenoh149 · on July 13, 2022

https://en.wikipedia.org/wiki/Tail_call

dan-robertson · on July 13, 2022

And yet Clojure doesn’t have general tail calls (it has recur which allows something like tail calls so long as they are non-mutually recursive) and seems to do ok.

kamaal · on July 13, 2022

Yes its in the context of Clojure, his point is tail calls should be provided in JVM itself, it can then be added to Clojure.

brrrrrm · on July 12, 2022

Edit: this is wrong.

For posterity my original comment was:

“From what I understand, tail calls can always (?) be lowered to while loops[1], which are expressible in WASM.

1. https://en.wikipedia.org/wiki/Tail_call#Relation_to_the_whil...

apignotti · on July 12, 2022

This is possible, and trivial, when self-recursing: A -> A

If you have an A -> B, or A -> [indirect] call, that is not the case.

marcosdumay · on July 12, 2022

When you have a set of functions that may recourse into each other, you can convert them into a iterative one by doing a `while cond {switch function_selector {..}}` block.

It's an ugly piece of code, but because of parsers, there is a huge amount of know how on it.

light_hue_1 · on July 12, 2022

You really can't.

For example, what if those functions are in different modules or libraries? Tail calls are still supposed to work.

marcosdumay · on July 12, 2022

If they are statically linked, it's not a problem, and since AFAIK wasm doesn't support dynamic linking, the only problem is on crossing security boundaries, that will be probably kept unsolved anyway. (IMO, this one is better left unsolved, it's to complicated a problem to require for interoperability.)

throwaway17_17 · on July 12, 2022

I am genuinely asking, is your position that a compiler cannot convert:

g(): a = 1+1; b = 2+a; print(b); return f()

into code that does not allocate stack space and just reuses the frame allocated for g()?

SamReidHughes · on July 12, 2022

It depends on the calling convention used -- whether it is caller clean-up or callee clean-up. With caller clean-up, the stack pointer is left below the argument list when the function returns. With callee clean-up, it gets popped above the argument list. You can perform tail calls naturally if it's callee clean-up. (With caller clean-up, a compiler might hypothetically pull off a tail call if the callee's argument list takes less memory, but it's not something you could add as a language feature.)

throwaway17_17 · on July 12, 2022

Thanks, that explains the trouble I was having. I was thinking about it without any consideration to calling-convention.

mibsl · on July 12, 2022

Only tail recursion, not tail calls in general.

jb1991 · on July 13, 2022

One need only look at the Clojure language to see how tail calls are not solved by a compiler.

jacobmischka · on July 12, 2022

Neat! This proposal caused me a lot of headaches, mechanizing its specification was the primary contribution of my Master's thesis a couple years ago[1]. I forgot until rereading it just now, but doing so caught a typo in the proposal specification[2], my extremely minor contribution to advancing WebAssembly.

Glad to see it finally moving forward after stalling for so long! Excellent work!

[1]: https://github.com/jacobmischka/uwm-masters-thesis/releases/... [2]: https://github.com/WebAssembly/tail-call/issues/10

diarrhea · on July 13, 2022

That might be the shortest (in word count) Master's thesis I have ever seen!

anitil · on July 14, 2022

Claude Shannon's masters thesis was quite short as well [0], around 25 pages!

[0] https://www.cs.virginia.edu/~evans/greatworks/shannon38.pdf

IshKebab · on July 13, 2022

Looks typical for a master's thesis to me. Maybe you're thinking of a PhD thesis?

diarrhea · on July 14, 2022

No, in my industry (mechanical engineering) Master (and also Bachelor) theses were always much, much longer. Longer lines, less vertical line spacing, many more pages. Lots of faffing about ('Introduction', 'State of the Art', 'Theoretical foundation', ...). Faculties urge supervisors and students to keep it below 100 pages (of relatively dense type).

jacobmischka · on July 15, 2022

Agreed on the formatting of mine being ridiculous with huge margins and line spacing, it wasn't my choice.

jacobmischka · on July 13, 2022

Ha well the mechanization was a nontrivial amount of work (for me at least) and was considered part of it too. If it's still short despite that, then welp I guess I got lucky somehow.

andsoitis · on July 12, 2022

> tail-calls has been proposed as an extension to the WebAssembly standard.

Do you know why it wasn't in the standard to begin with?

Even ECMAScript 6 mandates PTC (proper tail call) - article from 2016 on Webkit.org no less - https://webkit.org/blog/6240/ecmascript-6-proper-tail-calls-...

pygy_ · on July 12, 2022

The standard mandates it, and the V8 team implemented it, shipped it behind a flag, then unshipped based on reasons that initially seemed and ultimately were proven fuddy, when WebKit shipped PTC, and the world didn’t fall down.

The reason actual why it was withdrawn is that it would have required expensive changes to Microsoft’s Chakra (the calling conventions were incompatible).

Then Edge died... and Google didn’t add it back. Go figure.

Firefox also has problems implementing cross-realm tail calls. That could be spec’ed around, but there’s no will to do so.

leoc · on July 12, 2022

These absolute bozos. A Web Assembler that can't do jumps. What a show.

pygy_ · on July 12, 2022

My comment is about ES6 proper tail calls. WASM is another story

leoc · on July 12, 2022

Graaagh, sorry.

int_19h · on July 13, 2022

It's not that much better. Apparently at least some of the idiosyncracies in the design of wasm, such as the lack of regular unstructured branches, is due to the internal implementation details of V8.

titzer · on July 13, 2022

Every optimizing JIT for JavaScript has the same limitations.

int_19h · on July 13, 2022

Then maybe we shouldn't be repurposing VMs written for JavaScript for that brand new thing? Ditch the baggage and do it right. Especially given that, if successful, wasm is basically the next JS, and will be around for decades.

titzer · on July 13, 2022

I'm all for rewriting things from scratch. That's why I'm doing a new Wasm engine from scratch. But reusing TurboFan is how we took Wasm from concept to near-native performance, shipped in Chrome, in 2.5 years.

int_19h · on July 13, 2022

That's all well and good - but, again, the spec will be around for decades. Compromises made today to ship things faster means a lot of pain in those years ahead. I pity all the compiler writers who will have to implement relooper again and again.

But instead we'll probably do things like asm.js - that is, bless certain wasm patterns such that advanced VMs would be guaranteed to optimize them. And so everything will be way more complicated than it needs to be. Just like the JS stack today.

titzer · on July 13, 2022

That's why multiloop is coming.

int_19h · on July 14, 2022

Has there been any movement on that (or funclets) recently? All the repos I can find haven't been updated in a while.

silon42 · on July 12, 2022

If I understand what cross-realm means (code security), then I'm not sure that can be done at all.

pygy_ · on July 12, 2022

Firefox can’t eliminate tail call across realms. The spec could make an exception for that scenario

dwheeler · on July 12, 2022

That seems reasonable. Tail call elimination, limited to within a realm, still seems pretty useful.

miclill · on July 13, 2022

Thanks for clarifying. What is this PTC you mention?

pygy_ · on July 13, 2022

“Proper tail calls”, i.e. tail calls as specified for ES6