I wonder to what extent this is a statistical artifact
rather than a genuine improvement. For example, if you
measure performance as the mean time to serve each kilobyte,
then the proposed scheduler does not increase performance.
That's true, but what was actually measured was the mean time to serve the entire request. I think in nearly any usage pattern, this would be the data point that you would be most concerned with. You want users browsing your pages to get their pages and small images very quickly, and if they have to wait a couple seconds for the larger images to finish, or for minutes for large file downloads, that seems to be the kind of behavior a user would expect.
What this scheduling algorithm avoids is the case where several people are downloading large files and other more casual users are having a worse browsing experience because of that.
That's definitely part of it- they are heavyweight threads or processes (no matter how you configure it) (rather than lightweight erlang threads). I would imagine the bigger problem here, however, is context switching. Because Apache (and most other servers) use OS threading or forking, the entire state of the current process (let's say) has to be put aside and then restored later, maybe after only a few milliseconds- because the OS isn't going to presume anything about the state of that process. In Erlang's concurrency model it simply runs 20 (or so, I think- anyway some static number) functions and then goes to another process.
Because the functions are, uh, functional there is almost no state to save and restore when there is a context switch. Also because things like iteration are done with recursion- the 20 functions and then context switch doesn't get hung up on some loop somewhere. The faster the OS tries to keep switching between processes, the more overhead it introduces.
BTW, this is why in my opinion you don't see a lot of other systems duplicating this sort of concurrency model- because there are some language "limitations" (recursion only, for example) that make it possible.
Most apache, erlang, or kernel hackers would know way more about this than I do, of course.
In a good OS (I believe Linux qualifies), you only need to save the contents of the registers. Each process has its own address space, each thread has its own stack, and so there's no need to much around with memory beyond that. Stacks aren't actually copied around, the processor simply restores %esp and %ebp from the saved process data structure.
IIR my OS design course correctly, the big performance hit is the switch from user mode to kernel mode. I'm not sure *why* that's a big hit, but it seems to be a slow operation on most processors.
You can use user-mode threading libraries in C/C++, but Apache doesn't. Perhaps that's why it's slow. (The main reason it doesn't is probably that user-mode threading blocks the whole process when one thread performs IO, which obviously doesn't work well in an I/O bound application like a webserver.)
There are other C/C++ webservers - like Lighttpd - that use poll/epoll for I/O. These should run even faster than YAWS/Erlang; anyone have any benchmarks to compare them?
It's not so clear to me that a PhD is much use at a start-up
in general; you'd probably learn more relevant technical
skills at a regular programming job, and the mental habits
you're likely to learn in graduate school will actively hurt
you at a start-up. Unless you are lucky or careful about who
you work with and on what.