“Async” / event loop systems will usually be more efficient because 1) they don’t have to store unnecessary processor state to memory between handling events and 2) they are associated with cooperative event handlers and the assumption of a cooperative system provides more opportunities for optimization.
Ironically, it is not true that "async" (stackless) event loop systems don't store unnecessary processor state to memory.
Those systems store the entire processing state in future object memory. It is the same state that threaded systems store.
Stackful context switching systems (which includes some kinds of efficient threads - I would count Linux kernel internal threads among these) store that same state in the stack and context objects when context switching, so in principle store about the same amount of state. But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await, and therefore take more CPU time to context switch than stackful systems do.
> It is the same state that threaded systems store.
Nope. Synchronous task switching systems additionally have to store CPU register state. In the cooperative case you need to store the caller-saved registers, in the pre-emptive case you need to store all the registers.
Async systems simply don’t have to do that extra work.
> But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await
Just the really bad ones. In any competent system a callback is stored ready to handle the event. No “unwinding” necessary.
A cooperative synchronous task switching (i.e. fiber based) need only save the exact same information as an async based one (i.e. stackless coroutines): at a minimum a context pointer and an instruction pointer. Plus any live registers (which might be none).
You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.
> You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.
If you request a task switch from C or any high level language that has the concept of caller-saved registers and the compiler has no knowledge of your task switching system (vast majority of cases) you will be forced to pay an extra cost. Is there a practical system in common use that is able to elide register-saves that you’re referring to? Or is your point essentially that you don’t have save caller-saved/live registers in the theoretical case that you have no caller-saved/live registers?
You don’t need explicit compiler help. At least with C this can be done entirely with a library. A task_switch() call can conform to the standard C-ABI, requiring no compiler support, and do the switching (in assembly). Without duff’s device. This is for example how kernels written in C do their task switching.
Likely the same can be said for Rust and nearly any language, since they all have ABIs to which can be conformed such that task_switch() looks like a normal function call.
Oh, I have written my own share of userspace C context switching libraries, I know all the gory the details :). For example see my minimalist [1] stackful coroutine library: the full context switching logic is three inline asm instructions (99% of the complexity in that code is to transparently support throwing exceptions across coroutine boundaries with no overhead in the happy path).
You need compiler help for the custom calling convention support and possibly to optimize away the context switching overhead for stackful coroutines, which is something that compilers can already do for stackless coroutines.
The duff device is just a way to simulate stackless coroutines (i.e. async/await or whateverer) in plain C, in a way that the compiler can still optimize quite well.
> the full context switching logic is three inline asm instructions
You tell the compiler that you clobber the caller-saved registers (GPD_CLOBBERS), so in terms of cost it’s not just three asm instructions. Since these are caller-saved registers they will be live at every point in your code, even if your task switch routine is inlined. You have to consider the code the compiler generates to preserve the caller-saved registers before invoking your instruction sequence when evaluating total cost. This is an additional cost that is not necessary in callback style.
Caller-saved registers (aka. "volatile registers") are only saved when they are live in the caller at the point of a function call, and they are not always live. Code generation tends to prefer callee-saved registers instead at these points, precisely so they don't need to be saved there. Whether callee-saved registers are live at the inline task switch depends on whether they have been saved already in the function prologue, and if they are in use as temporaries. Not many registers are live at every point, typically just the stack and frame pointers.
Both types of code (async and stackful) have to save live state, whetever it is, across context switches, whether that's spilling registers to the stack in the stackful case, or into the future object across "await" in the async case. However, typically the code generator has more leeway to decide which spills are optimal in the stackful case, and the spills are to the hot stack, so low cost. Unless integrated with the code generator, async/await spills tend to be unconditional stores and loads, thus on average more expensive.
You're right about potentially redundant saves at stackful context switch. (Though if you control the compiler, as you should if you are comparing best implementations of both kinds, you can avoid truly redundant saves)
However, in practice few of the callee-saved registers are really redundant. If the caller doesn't use them, it's caller or some ancestor further up the chain usually does. If any do, they are genuine live state rather than redundant. There are cases you can construct where no ancestor uses a register, or uses one when it would be better not to, so that in theory it would be better not to use it and not to save it on context switch. But I think this is rare in real code.
You must compare this against the the various extra state storage, and memory allocations, in async/await: For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack, and the way async/await implementations tend to allocate, fill, and later free a separate future object for each level of await in the call stack. All that extra storing is not free. Also, when comparing against the best of stackful, how many await implementations compile to pure continuation jumps without a return to an event loop function just to call the next async handler, and how many allow await results to be transferred directly in registers from generator to consumer, without being stored in the allocated future object?
I would summarise the difference between async/await and stackful-cooperative is that the former has considerable memory op overheads, but they are diffused throughout the code, so the context switch itself looks simple. It's an illusion, though, just like the "small asm" stackful context switch is an illusion due to clobbered live registers. The overhead is still there, either way, and I think it's usually slightly higher overhead in the async/await version. But async/await does have the advantage of not needing a fixed size "large enough for anything" stack to be preallocated per context, which it replaces with multiple and ongoing smaller allocations & frees per context instead.
It would be interesting to see an async/await transform applied to the Linux kernel, to see if it ended up faster or slower.
I see the confusion now, I intended all of my arguments regarding caller-saved to actually refer to callee-saved registers. Hopefully you understand that you can never avoid preserving callee-saved registers with a cooperative task switching system and that this is not a necessary cost in an async/callback system.
> For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack,
Regarding spill of live registers to heap state in the Async case vs stack state in the sync case, contemporary compilers are very good at keeping working state in registers and eliding redundant loads/stores to memory as long as it can prove that that memory is not referenced outside of that control flow. This is truth whether the source of truth is on the stack or the heap. This is due in part to the C11 memory model, which was somewhat implicit before it was standardized. In other words, an optimizing compiler does not treat all load/stores as “volatile *”. Furthermore, the heap state relevant to the current task would be as “hot” as the stack (in terms of being in cache) since it’s likely recently used. Given that I am skeptical of the argument that using heap state is slower than stack state due to increased register spillage. Again, just to be clear, this entire case is a separate discussion from the unavoidable added cost of preserving callee-saved registers in the synchronous case.
Where heap state does have a cost is in allocation. Allocating a stack variable is essentially free, while allocating heap memory can be expensive due to fragmentation and handling concurrent allocations. This cost is usually amortized since allocation is usually done once per task instantiation. Contrast this with register preservation, which is done on every task switch.
This only adds to the confusion. Which one is the caller and the callee? The executor and the coroutine? Or the coroutine and the task-switch code?
In my implementation, at the task switching point, there is no callee saved registers, all registers are clobbered and any live register must be saved (also there is no difference between coroutine vs non-coroutine code, control transfer is completely symmetrical), so calling from executor into the coroutine and coroutine into the executor (or even between coroutines) would run the same code.
At the limit both async (i.e. stackless coroutines i.e. non-first-class continuations) and fibers (i.e. stackfull coroutines, i.e. first-class continuations) can be translated mechanically in CPS form (i.e. purely tail calling, never returning functions, the only difference is the amount of stack frames that can be live at a specific time (exactly one for stackless coroutines, unbounded for stackful), so any optimization that can be done on the former can also be done on the latter, so a stackful coroutine that always yields form top level would compile down to exactly the same machine code as a stackless coroutine.
Thanks for the great contribution to the thread. Pretty much my thoughts.
I do believe that having to allocate a large stack is a downside, but again, with compiler help it should be possible, at least in theory, to compile stackful coroutines whose stack usage is known and bounded (i.e all task switch happen at top level or on non recursive inlined functions) to exactly the same stack usage of stackless coroutines.
Sibling has an amazing long response, but tl;dr, the live registers that are clobbered are exactly the same as those that would need to be saved in the async case.
I see the confusion now. I wrote caller-saved when I meant callee-saved. In general you don’t need to preserve callee-saved registers if you don’t use them during normal control flow but in the “sync” case you always have to save callee-saved registers on task switch. In the Async case, you simple return to the executor.
> ...and the assumption of a cooperative system provides more opportunities for optimization
Also much more opportunities for blocking and starvation. Especially when people try to reinvent schedulers without understanding how they work in the first place.