Very good summary. I would only add that it pays to mention that thread/fiber stacks typically only reserve 1-2MB of VM, and only a small amount is committed initially (1 to 2 pages). With 32-bit address spaces, this is a minor distinction because the size of the address space is what matters the most, but with 64-bit address spaces, it can mean that you can aren't putting as much pressure on the VMM as one might think. If all of the threads in a process only use the default two pages, and the system page size is 4K, then 10,000 threads are only going to result in a commit size of 81MB of VM for the stacks. Of course, that's a lot of "ifs", and it completely ignores the effects of that many threads on the per-process VM mappings/CPU caches.
I keep getting comments about this, but it's not that simple. First of all, unless you're overcommitting, those stacks consume more resources than one might think at first blush. Second, the apps I am familiar with (on the Java side) get pretty deep stacks (you can tell from the stack traces in exceptions!). Third, if you want millions of stacks on a 64-bit address-space, I'm thinking you'll want larger pages if much more than a handful of 4KB pages per-stack get used (as otherwise the page tables get large), and then you're consuming large amounts of RAM anyways. I may be wrong as to the third item, but I suspect the actual depth of the stacks as-used makes a big difference.
The point is that thread-per-client is just never going to scale better than C10K no matter what, and that the work we see in this space is all about finding the right balance between the ease-of-programming of thread-per-client and the efficiency of callback hell (typical C10K).
For any size computer (larger than tiny), C10K designs will wring out orders of magnitude better scaling than typical thread-per-client apps no matter how much you optimize the hardware for the latter.
Thread-per-client can only possibly compare in the same ballpark as C10K when the actual stack space used is tiny and comparable to the amount of explicit state kept in the C10K alternative, and even then, the additional kernel resources consumed by those per-client threads will dwarf those needed for the C10K case (thread per-CPU), thus adding cache pressure. In practice, thread-per-client is never that efficient because the whole point of it is that it makes it trivial to employ layered libraries of synchronous APIs.
Now, I'm not saying that the alternatives to thread-per-client are easy, but we should do a lot better at building new libraries so that async I/O can get easier.