I'm aware that Rust's threading has higher overhead because they're system threads, but what about green/user threads? Is there intrinsic overhead to userspace threads that async doesn't have? I think every "task"/thread needs it's own callstack space, and you're paying scheduling overhead no matter what. Is there literature on the topic?
The short answer is that an async function in rust (and C++ and C#) is rewritten by the compiler to become a state machine (a struct + an enum to discriminate the current state + an union store local variables). You could do it by hand, though of course it's tedious and error prone. Hence why almost nobody writes software like that.
For the compiler, the state machine code is amenable to optimizations. You can find examples online where a chain of async function is optimized to a single call in the main.
In contrast green thread and the like are very similar to a kernel thread. The difference is they are lighter because there is less bagage to carry around. But composing green threads require an arguably light context switch.
As far as I understand, green/user threads were on the table in rust team initially, but were sorted out because having green threads also means there is a runtime managing them which rust tries to avoid.
What I do not understand is where the principal difference with async lies: it has no runtime, but you have to bring your own for it to work anyway in form of e.g. tokio. What rust goal conflicts with a similar solution for green threads?
The article is specifically complaining about how the design of Rust's async ecosystem forces one to use wasteful Arc<Mutex<T>> etc in places where the data has no reason to move across threads.
Meanwhile, a greenthread system can happily use local state without atomics.
I personally want Tokio and Glommio to have a baby that inherits good parts of both parents.