Stackswap coroutines, neither stackful nor stackless

throwaway81523 · on Oct 8, 2022

I guess this is cool. It works by copying the current coroutine's stack into a memory buffer when the coroutine yields, and copying the next coroutine's memory buffer into the main stack, rather than having separate stacks for each coroutine and juggling the hardware stack pointer. Obviously the copying adds overhead but for small coroutines it will not be too bad. In return you get rid of the need to pre-allocate memory for each stack, and if a coroutine never blocks, it runs like a normal subroutine, no stack juggling.

It would be great if the post included some suggested applications. I also found the benchmark graphs hard to read. A table with numbers would have been easier.

This is mostly of interest for very low memory systems, right? Otherwise it's easier to use conventional stackful coroutines or multitasking. Even the traditional tiny Forth multitaskers were stackful, I think.

In this environment the stack size of a coroutine should be about constant at the yield points, or else maybe you want to allocate the copy buffers on the heap? Also if the coroutines don't persist, you may need a way to free their storage. Alternatively, maybe you can GC the memory buffer once in a while, by collapsing out unused parts.

nikita_d · on Oct 8, 2022

Stack buffers are allocated in the heap (by the "default" rr.c implementation of usched::s_alloc()). Allocating them on the stack, as you seem to be proposing, is a really clever idea: you can do something like rr_thread_init(&startfunc, arg, alloca(BUFFERSIZE)). The buffer is freed when the coroutine is terminated (rr_done()).

gavinray · on Oct 8, 2022

  > "I also found the benchmark graphs hard to read. A table with numbers would have been easier."

This is my only complaint too, very interesting stuff.

nikita_d · on Oct 8, 2022

I added the tables to the post.

throwaway81523 · on Oct 10, 2022

Thanks, it would also be good to have some descriptions of what the numbers are? What do you do about saving the registers at yield points? Maybe this is most easily done with compiler support.

I had been envisioning copying the stack images to a memory array rather than using alloca, though maybe alloca would also work. I see in your benchmarks you tested this on a quite large system. I had imagined something much smaller, like an Arduino. I might try porting your benchmark to GHC, Elixir, and maybe even Forth.

JonChesterfield · on Oct 8, 2022

Kind of interesting. Allocate a stack for the coroutine on context switch and memcpy the current stack into it.

Pointers into the stack get invalidated on context switch which kills most of the advantage of a stackful coroutine. Alloc+copy per context switch makes context switching more expensive which kills the rest of the advantage. It does seem simpler than dealing with unknown size stacks though.

I see the attraction of allocating the stack on first yield but not on every yield.

fzzzy · on Oct 8, 2022

This is how Python's greenlet works. It works very well.