Very good summary. I would only add that it pays to mention that thread/fiber st...

cryptonector · on Oct 9, 2018

I keep getting comments about this, but it's not that simple. First of all, unless you're overcommitting, those stacks consume more resources than one might think at first blush. Second, the apps I am familiar with (on the Java side) get pretty deep stacks (you can tell from the stack traces in exceptions!). Third, if you want millions of stacks on a 64-bit address-space, I'm thinking you'll want larger pages if much more than a handful of 4KB pages per-stack get used (as otherwise the page tables get large), and then you're consuming large amounts of RAM anyways. I may be wrong as to the third item, but I suspect the actual depth of the stacks as-used makes a big difference.

The point is that thread-per-client is just never going to scale better than C10K no matter what, and that the work we see in this space is all about finding the right balance between the ease-of-programming of thread-per-client and the efficiency of callback hell (typical C10K).

For any size computer (larger than tiny), C10K designs will wring out orders of magnitude better scaling than typical thread-per-client apps no matter how much you optimize the hardware for the latter.

Thread-per-client can only possibly compare in the same ballpark as C10K when the actual stack space used is tiny and comparable to the amount of explicit state kept in the C10K alternative, and even then, the additional kernel resources consumed by those per-client threads will dwarf those needed for the C10K case (thread per-CPU), thus adding cache pressure. In practice, thread-per-client is never that efficient because the whole point of it is that it makes it trivial to employ layered libraries of synchronous APIs.

Now, I'm not saying that the alternatives to thread-per-client are easy, but we should do a lot better at building new libraries so that async I/O can get easier.