Maybe? it depends how you use them... any asynchronous operation will require memory, and that memory has to go somewhere. In CPS it goes on the heap... in the threaded style you have a stack you can put it on. This probably will waste some memory, though not as much as you might think. On the other hand, you can also reclaim the stack memory as soon as the operation finishes. The CPS technique creates garbage that accumulates until you do gc cycle, but the whole principle of fast concurrent GC is to let garbage accumulate! I'm really not sure what will end of consuming more memory, particularly if you design your threads to primarily use stack allocation, which is admittedly hard in a language like java.
Green threads by definition cannot use OS stack and must allocate their stack memory on heap. Although this memory can be reused, as it is known from Go to avoid performance bottlenecks at least for Go code is better to allocate the stack as single continues block and copy the stack to a bigger block when thread’s stack reaches the current stack size. But then the whole stack space is pinned to the thread and cannot be reused.
For Java it may still be possible not to allocate the whole stack as a single chunk and instead have smaller chunks like one per few frames. But I really doubt that it can reduce memory pressure compared with CSP in real applications especially given how good GC became in Java.
So in Java we know a few things about the stack that are not true for other languages. We know nothing on the stack is a pointer into a Java stack frame, and nothing on the heap points into a Java stack frame. These facts allow us to mount virtual threads onto carrier threads by copying portions of the stack to and from the heap. This is normally less memory than you’d expect because although you might have a pretty deep stack most of the work will happen in just a few stack frames, so the rest can be left on the heap until the stack unwinds to that point.
The big advantage of this over CSP is that you can take existing blocking code and run it on a virtual thread and get all the advantages, there is no function colouring limiting what you can call (give or take a couple of restrictions related to calling native code).
I like CSP precisely because it requires to color-annotate the code so it is knows what can and what cannot do IO! Surely it decreases flexibility, but makes reasoning about and maintaining the code easier.
Thread stacks are not OS level objects, at least in linux you just malloc or anon-mmap some memory and pass that to clone() or you own green thread implementation.
The question is can unused potion of the stack be used for anything else? With native threads the answer is no and so is with Go green threads. Time will tell if Java can pull off the trick of sharing unused space place, but I am sceptical.
With POSIX threads the stack size defaults to something like 1MB or 2MB depending on the platform, but it's not allocated up front -- the stack grows as needed up to that maximum.
The main difference then between allocating stack chunks on the heap as needed, and stacks grown by the virtual memory subsystem, has to do with virtual memory management matters. If you can use huge pages for your heap, then allocating stack chunks on the heap will be cheaper than traditional stacks.
In CPS the state of the program is captured in the continuation, which is a closure, which is allocated on the heap, and in any ancillary data structures pointed to by it.
However, there's generally only a very small number of such closures -- typically only one -- and they are generally one-time use only. That means they can be freed as soon as they tail-call out. Hello Rust.
I've written hand-coded CPS in C, but I take your point. The issue is that while every function exits via tail-calls, so the stack footprint is small, every continuation is a closure that has to be allocated somewhere, and that somewhere is generally going to have to be the heap.