Re: 100k Threads? (2002)

fulafel · on Sept 23, 2023

400k thread demo from 20+ years ago fits into the context of the new Java 21 virtual threads discussion - many people seem to think you can run only a few thousand OS threads or that they use a lot of memory.

usrusr · on Sept 23, 2023

You can't imagine how I hate these discussions: they make me feel old. Old, because I so vividly remember the Big News of Java 1.2 switching from green threads to native.

I'm actually not that old, I didn't use java at the time, but those news were my introduction to the term "thread" so they kind of stayed with me. A tiny part of my brain still files "threads" under "the thing java once didn't use natively" and wow, that part feels old now.

liquidpele · on Sept 23, 2023

You act like Java isn’t copying what every other language has had for years or decades. Green threads are not threads they’re just an async system in the existing runtime, they share one or more real threads as workers. They do that because creating real threads requires OS features like permission checking, memory management, cpu scheduling, etc. pick the right abstraction for the work.

H8crilA · on Sept 23, 2023

Also, green/virtual threads are overrated, IMHO. Just make system threads, it's not nearly as bad as people think, and it is a lot easier. You will likely want to cap max stack size (so that unintentional large stacks cause crashes instead of using up unreasonable amounts of memory) and perhaps tune a few other things.

ryanjshaw · on Sept 23, 2023

The big benefit I think is in libraries. I don't want to deal with 10 different ways of doing async in 10 different libraries, nor think about how to co-ordinate between them.

kaba0 · on Sept 23, 2023

Is every other language go and elixir?

Also, there are many different characteristic of any such system: M:N is not the impressive part at all.

pestatije · on Sept 23, 2023

seeing what java virtual threads can do it brings the question, are OS threads done the wrong way? why cannot the OS do what java is doing?

twic · on Sept 23, 2023

The significant difference between virtual and real threads is the stack. Real threads have to support programs written in languages which assume that the stack is a large, contiguous block of memory which they can use without cooperation with the operating system. In practice, that means a large slice of address space has to be allocated for each thread, along with page table entries, and at least a page of physical memory. Whereas virtual threads are deeply integrated with the compiler and runtime which uses them, and don't need to support such programs, meaning they can do more subtle things, with growable stacks, segmented stacks, moving frames off and on the platform thread stack, etc.

I suppose you could implement these more subtle things with real threads, but it would require a richer interface between the program and the operating system, which would need to be supported in compilers and language runtimes. FreeBSD did something like this with kernel scheduled entities, but it never caught on.

As an aside, i wouldn't call virtual threads cooperatively scheduled. To me, that means that user code has to be written with scheduling in mind, to make sure that it hits yield points where the thread can enter the scheduler. Virtual threads don't require that, because the JVM can ensure the necessary entries to the scheduler, since it controls interpretation and compilation. That said, i believe that in the current implementation of virtual threads on the JVM, it is possible that a tight loop that doesn't call any methods will not be rescheduled; i'm hazy about the details of this, because surely this interacts with safepoints. But maybe it's only very slightly cooperatively scheduled!

im3w1l · on Sept 23, 2023

Just thinking aloud here... So functions already (typically) have a bit of prologue for setting up the stackframe, where they juggle rsp and rbp etc. I wonder if you could change the convention for that into something that basically does RequestStackFrame(SIZE) for some known SIZE. RequestStackFrame could then either grow the stack linearly or get memory in some entirely different region depending on what is needed.

Though doing such a function call for every tiny little function is a bit excessive of course, I'm thinking maybe the compiler should track for every function how much stack it will need in the worst case (usage for the function itself + max of usage for the functions it calls). Functions that use memory below some threshold just grow linearly, while those that use above it will go to the trouble of requesting memory. Now this idea assumes that function calls happen in a DAG, it breaks if functions can call themselves or each other in a cycle. But that's actually a good realization maybe. Like a recursive function should perhaps always explicitly request stack space. Security and robustness benefits!

What about libraries? Ideally library functions would be annotated with how much stack space they need (as a side note I have long thought that C-headers are inadequate for specifying how a function should be called. Not annotating stack space requirements one deficiency here!). Absent such annotations, I suppose we need to go with the status quo of making some guess.

ori_b · on Sept 23, 2023

Roughly what you're discussing has been implemented in gcc (behind a flag) for a long time: https://gcc.gnu.org/wiki/SplitStacks

im3w1l · on Sept 23, 2023

Interesting! It does seem to be missing the part about tracking stack usage at compile time though, unless I'm missing something. The more I think about it, the more that seems like the key part, and dynamically growing the stack being more of a fallback thing.

papercrane · on Sept 23, 2023

> That said, i believe that in the current implementation of virtual threads on the JVM, it is possible that a tight loop that doesn't call any methods will not be rescheduled;

Right now virtual threads that don't enter into any blocking code won't be unscheduled by the JVM.

twic · on Sept 23, 2023

Oh, I thought they could also be scheduled at safepoints? Is that not in there yet, or not going in now?

papercrane · on Sept 23, 2023

That's the plan, but it's not there yet.

exDM69 · on Sept 23, 2023

Virtual threads, or green threads use a co-operative threading/coroutine model on top of the preemptive threads provided by the OS.

In addition to Java virtual threads, similar systems exist in Go goroutines, Rust async and Haskell green threads for example.

There are upsides and downsides to cooperative userspace threads. An erratic or malicious green thread can hog all the CPU time and starve other threads for example.

An operating system can not trust the userspace to behave, starvation due to misbehaving thread is unacceptable.

So the OS can not implement green threads alone, it needs the userspace for the co-operative switching.

yetihehe · on Sept 23, 2023

In erlang you have preemptive green threads. They don't block* and having a million on one medium server is not unheard of.

* I managed to block a whole erlang VM one day, spoke with Joe Armstrong on chat, but they won't fix this, as it's pretty niche use case.

darraghenright · on Sept 23, 2023

I'm probably not the only one curious to hear more about this :)

yetihehe · on Sept 23, 2023

Make a ordered_bag ets table and try to insert several thousand records in one ets call. It will block whole VM (spinning one core at 100%, doesn't matter how many you have) for several seconds. It does this because ordered bag needs to find each key in table as it inserts it, resulting in n^2 complexity and this needs to be done atomically. So for 1 000 keys you have 1 000 000 comparisons during a VM-wide lock. Solution - don't insert so many records in one call into ets table.

ninkendo · on Sept 23, 2023

I’m more curious to understand how userspace threads can be preemptive. :)

I can think of a few ways:

- The VM just doesn’t JIT and can decide to stop executing a thread by just not interpreting the next piece of bytecode and switching to another green thread instead (this would be pretty slow due to the lack of JIT)

- The VM JITs, but inserts a preamble before every function call saying “Before executing this function, should I switch to another green thread first?”, and thus, so long as you call functions frequently enough, you “preempt” yourself. This is how Go does it, and it’s a well known thing in Go that if you never call a function for a while (like just doing a really huge for loop), the current goroutine doesn’t yield execution and hogs the whole OS thread.

messe · on Sept 23, 2023

> - The VM JITs, but inserts a preamble before every function call saying “Before executing this function, should I switch to another green thread first?”, and thus, so long as you call functions frequently enough, you “preempt” yourself. This is how Go does it, and it’s a well known thing in Go that if you never call a function for a while (like just doing a really huge for loop), the current goroutine doesn’t yield execution and hogs the whole OS thread.

I'm not sure of the implementation details, but this hasn't been true for a while in Go. As of Go 1.14, goroutines are asynchronously preemptible, so loops without function calls no longer deadlock the scheduler or GC: https://go.dev/doc/go1.14#runtime

semiquaver · on Sept 23, 2023

The docs there hint at how it’s done in go and how it could be done in erlang: the runtime monitors how long a given goroutine has been running without yielding the scheduler, and uses a signal handler to interrupt code that has exceeded a 10ms quota of continuous usage.

https://medium.com/a-journey-with-go/go-asynchronous-preempt...

ninkendo · on Sept 23, 2023

Interesting! My info is way out of date then. Thanks for the link.

gpderetta · on Sept 23, 2023

In principle you can also set a timer that raises a signal and then switch to a different coroutine when it fires. But it is a big can of worms and won't perform great.

jabl · on Sept 23, 2023

IIRC the Erlang VM does the second (schedule on function call). Since Erlang is a functional language and loops are done via recursion, this works out fine.

yetihehe · on Sept 24, 2023

Yes, on every function call, also some internal functions implemented in native code are instrumented with those checks. On average thread is switched out after about 2000 "reductions" as they are called. Also there's an optimization, where if you send something to another thread and you are waiting for reply, VM switches instantly to the other thread, which makes some message sending equivalent to a simple function call.

pdpi · on Sept 23, 2023

All else being equal, the one thing Java can do that the OS can’t is avoid paying for context switches. There is a cost to crossing the user space/kernel space boundary that you can avoid with user space lightweight threads.

fulafel · on Sept 23, 2023

Both of course pay something, but virtual threads may pay less.

Is this significant for virtual thread use cases? OS threads are cheap since there's no virtual memory context switch.

otabdeveloper4 · on Sept 23, 2023

False. The context will need to be switched regardless of where you're doing it. (Userspace or kernel.)

The cost of crossing the userspace/kernel boundary happens on any system call and has nothing to do with context switching.

(Modern OS's typically context switch during system calls for i/o, exactly the same as userspace threads.)

Sesse__ · on Sept 23, 2023

The page tables change in a kernel context switch, whereas they don't in a user context switch. Depending on your CPU and OS, this may or may not matter; e.g., modern CPUs typically have caches that are tagged with the context (e.g. PCID on Intel), but the Spectre flavor of the day may require that you simply flush the entire L1 cache on context switch anyway if you're unlucky.

The biggest cost of a context switch, though, is typically that your CPU goes to do something else, which reduces the efficiency of any caches (including branch prediction etc.). This is the same whether you have kernel or userspace context switches, assuming the threads you're switching between touch different code and/or data.

gpderetta · on Sept 23, 2023

Well, page tables don't change on a thread context switch, that is relevant for this discussion.

PhilipRoman · on Sept 23, 2023

Why do page tables need to change in a context switch between two threads? AFAIK the kernel is mapped in every process, just not accessible in user mode.

ninkendo · on Sept 23, 2023

Green threads can be switched between in a single time slice without ever doing a single syscall. This is trivially true: you can make a simple coroutine implementation in C that can switch back and forth between routines thousands of times in a single timeslice.

otabdeveloper4 · on Sept 23, 2023

Technically true, but nobody ever does this because it is a massive waste of CPU cycles.

Both OS and "green" threads are meant to switch when doing i/o.

gpderetta · on Sept 23, 2023

Userspace IO is a thing though.

Even with traditional kernel side IO, you might not want to attempt a syscall if the socket is known not to be ready.

Also when implementing purely userspace synchronization primitives, being able to swap to a different context purely in userspace is nice.

gpderetta · on Sept 23, 2023

User/kernel space switch is a different kind of switch. Also when you do it cooperatively, context switches can be a bit more efficient as the runtime has better knowledge about the set of live registers at the switch point.

otabdeveloper4 · on Sept 23, 2023

The "different kind of switch" you talk about doesn't exist on any OS of the past 30 years.

gpderetta · on Sept 23, 2023

I'm talking about raising a process privilege level, via whatever syscall mechanism. Which is a completely different code path than a process/thread context switch and, among other things, doesn't require going through the scheduler.

pshirshov · on Sept 23, 2023

Java's Virtual Threads (as any other green threads/fibers/etc) are scheduled cooperatively. Effectively a fiber is a tiny data structure which contains a closure which has to be executed next and some metadata.

dboreham · on Sept 23, 2023

They're not quite the same kind of thread. But some OS do have support for lightweight threads, e.g. NT. Also OSes today do handle as many threads as most applications ever need.

tester756 · on Sept 23, 2023

let's go even further

Why do we even waste resources on OS when we can write e.g UEFI application and run it "directly" and let it manage resources without OS overhead

Nowadays runtimes like Java's, etc, etc. are doing more and more tasks that previously were managed by OSes

gpderetta · on Sept 23, 2023

Sure, unikernels are a thing.

But being able to run a full OS is convenient, and often still faster overall as there are magnitudes more people working at optimizing general purpose OSs than specialized unikernels.

spease · on Sept 23, 2023

By the time you got done writing device drivers, etc. you would just have another OS.

https://xkcd.com/927/

Also “without OS overhead” - things sitting in memory don’t necessary detract from performance, but not having hardware acceleration for lack of drivers certainly can.

The only overhead guaranteed during execution by the OS is context switching, and if you want to avoid that you’re basically writing an RTOS for an embedded system.

inetknght · on Sept 23, 2023

> 20+ years ago fits into the context ... they use a lot of memory

To be fair, including the stack frame and any things the kernel needs to track them, it is a lot of RAM... in 20+ years ago sizes of memory. Today, even mid-grade consumer computers come with tens of GB of RAM.

ori_b · on Sept 23, 2023

And, thankfully, newer computers have virtual memory and page tables, which means that allocating a large stack only costs one page of physical memory until you use it.

amelius · on Sept 23, 2023

When will we see more dynamically allocated stacks? I'm still getting stack overflow errors on small problems running on my 64GB system, in various languages. This feels like how strings are allocated in old languages (like COBOL?), just fixed sizes for everything.

gpderetta · on Sept 23, 2023

Stacks dynamically grow, but there is usually a default upper bound (typically 8MB) to catch runaway recursion and other bugs before it start trashing memory. If you really need 64GB stacks you can set that bound. But do you really need huge stacks?

amelius · on Sept 23, 2023

> Stacks dynamically grow

I think that stacks grow using pagefaults, so you might still run out of address space.

I.e., you can't have two programs running where (arbitrarily) one of them consumes (nearly) all of your memory on a system where address space size equals memory size.

fulafel · on Sept 23, 2023

This was also the case in 2002.

vinay_ys · on Sept 23, 2023

You can have lots of threads – they are just thread stacks holding execution context – but that doesn't mean they will run well. You also need lots of execution units on which they can run. Even then, if any thread can sleep/wake-up at any random time, you will have a lot of jitter in your thread response latency. So, such large number of threads is anti efficient or predictable performance. It is not about that at all but instead it is about programming style/habits.

fulafel · on Sept 23, 2023

You don't need more execution units than for green threads. Typically in these zillion thread use cases, threads wake up for IO, queue activity, etc and are not spinning doing CPU bound computation, so not a hard problem for the OS thread scheduler.

pmarreck · on Sept 23, 2023

Just a few weeks ago I did a one-million-process spawn demo in Elixir...

https://www.youtube.com/watch?v=yxyYKnashR0

martinald · on Sept 23, 2023

Your point about the terminal being the bottleneck is interesting btw. I recently did some large imports into postgres in .net, and was running in debug mode, which had many queries logged to the debug console. Removing that logging led to about a 10,000x speedup. I wonder how much terminal output like this costs in terms of lost efficiency :).

tariksbl · on Sept 23, 2023

Today (but not in 2002) if i needed a server with 100k concurrent activities, I'd look to async frameworks - futures/ callbacks but a small number of threads. Akka, reactive streams, etc.

Is a huge number of threads as useful today as 2002?

inglor · on Sept 23, 2023

If you could easily spin up hundreds of thousands of threads you wouldn't need any of those frameworks, callbacks or userland threads. You could use the OS level abstraction and not have your language/framework implement its own different abstraction.

gorjusborg · on Sept 23, 2023

There is a certain amount of Stockholm syndrome built into the current acceptance of async programming paradigms.

Linear code is far easier to reason about, but that is easy to forget once someone has become accostomed to thinking in async constructs.

The thing is, there are only so many things a person can mentally juggle. Removing the consideration of a whole dimension of issues means the developer can take on more useful complexity.

ttfkam · on Sept 23, 2023

Naive threads and async are for fundamentally different tasks:

Threads are for working in parallel, splitting a single compute-intensive tasks into many.

Async is for waiting in parallel for disk I/O results, database queries, network responses, etc.

The former are for doing more work. The latter is for efficiently waiting for others to do their work without using a bunch of excess thread context resources.

You don't reduce complexity by ignoring the difference between two quite distinct use cases. That's false simplicity.

fiedzia · on Sept 23, 2023

> without using a bunch of excess thread context resources

If the threads are cheap, there is no point in making this distinction, which greatly simplifies the language. And they can be, we just need programming languages designed for that.

ttfkam · on Sept 26, 2023

Some people, when confronted with a problem, think, “I know, I’ll use threads,” and then two they hav erpoblesms. – Ned Batchelder

the_biot · on Sept 24, 2023

Here, here. The elephant in the room in just about any discussion about threads/async is Python. Threads simply don't function in Python because of the GIL.

Instead of accepting this fact and using another language if threads are needed, Python has been doing various async-style things for years and pretending it's good enough, including what can only be described as cooperative multitasking, that horror from the early 90s.

fiedzia · on Sept 24, 2023

Python is working on removing GIL (I am not sure about timeframe) and will have subinterpreters - the last one will come in 3.13 and solves the thread problem.

ttfkam · on Sept 23, 2023

Non-blocking I/O channels use fewer threads (native or green) than tasks. Some programming languages ARE designed to handle this: they use async/await.

You could argue that Go does this more elegantly with goroutines, but this further highlights the point that threads should not be a silver bullet to development models.

hu3 · on Sept 23, 2023

The point is you don't need native threads. You can use lightweight (green) threads.

So if a green thread is waiting for I/O, another can use the CPU just fine.

Yes there's context switching but it does scale and code becomes simpler to write and reason about.

hu3 · on Sept 23, 2023

If I can make good use of hardware without resorting to async programming, I would prefer to. Green threads allow that.

For starters, debugging async stacktraces is a nightmare in many languages/runtimes.

There's also the cognitive load that async programming adds to humans.

And there's function coloring. It tends to spread and "infect" the codebase.

Also most languages have to duplicate their APIs to support both linear and async calls.

And then there's driver support. In languages I have seen, async support came with the requirement of specialized or at least adapted drivers.

Avlin67 · on Sept 23, 2023

on 2P epyc is 96cores x 2 threads per core x 2 socket => 384 logical cpu

so 100k threads mean 260 threads per cpu, which is not so much then...

wtallis · on Sept 23, 2023

And with EPYC Genoa-X there's over 1GB of L3 cache per socket, so the 800MB of kernel stacks mentioned (probably larger now that we're talking 64-bit) can fit in cache.

Avlin67 · on Sept 23, 2023

and with 2P bergamo it gets down to 195 threads per cpu..