Thanks, it would also be good to have some descriptions of what the numbers are? What do you do about saving the registers at yield points? Maybe this is most easily done with compiler support.
I had been envisioning copying the stack images to a memory array rather than using alloca, though maybe alloca would also work. I see in your benchmarks you tested this on a quite large system. I had imagined something much smaller, like an Arduino. I might try porting your benchmark to GHC, Elixir, and maybe even Forth.