Well, I didn't know it so I tried it on my laptop but sadly that doesn't reflect my observations. Injecting here with 2000 concurrent connections results in a very fluctuating load oscillating between 2000 and 31000 requests/s (column ^h/s) for a small object like mime.types :
And after that it totally stops responding until I restart it. On 404 it's more around 37000. This is using 2 threads.
I looked at the code and saw a select() in use so that limits the number of concurrent connections to ~512 on recent libcs (1 fd for the socket, 1 fd for the file, 1024 total). Reducing the number of concurrent connections seemed to help a bit (it delayed the occurrence a bit).
Also it doesn't seem to support keep-alive so we can't get more performance on the client side by reducing the kernel-side work in the TCP stack.
I'm getting the same level of performance out of a single process on thttpd.
Doing the same stuff with haproxy gives me 72000 requests/s in close mode as well delivering a small object with the errorfile trick, at only 80% CPU (my load generator reached its limit), so I guess there's still room for improvement since it's possible to achieve twice the performance using 80% of a single thread on the same machine, and it reaches 88k using the cache. I don't know how other servers compare though, but I definitely think that some might get better results.
select() should definitely never get used -- it's only even compiled in to the executable if you enable the non-default option of "FILED_NONBLOCK_HTTP" which you almost certainly don't want. Did you compile filed in this way ? It sounds like you grep'd the source for select() but didn't actually see that it was #ifdef'd out.
The number of concurrent connections is limited by your resource limits and the fact that file descriptors are cached. You can tune the cache, but you should update your resource limits if so, this is documented here in the man page available here: http://filed.rkeene.org/fossil/home?name=Manual
Using only 2 threads (way less than the default) is also very sub-optimal since most of those threads will be waiting for the kernel to deliver the file. On average filed does about 2 system calls before asking the kernel to send the file to the client -- One read() of the HTTP request, one write() of the HTTP response header, and then the sendfile() of the contents. There's no reason not to use more threads than 2 other than to limit I/O (since they do not increase CPU utilization significantly).
Also, thanks for the link to the code, I found the issue with keep-alive (in fact there are two such issues). The first one is that you're looking for the Connection header with the Keep-Alive value, but this is needed only for HTTP/1.0. In 1.1 it's by default so you will not always get it, it will depend on the browsers, proxies etc. The second thing is that checking for headers and values this way is very unreliable as the connection header can contain other stuff, like "TE" to indicate that the TE request header field is present and must not be forwarded to next hop. In this case, the Connection header tokens are delimited by commas and you don't know either if it will match. One could argue that all these cases are not very common in field but it's the difference between being spec-compliant thus interoperable and working most of the time :-)
So I could inject in keep-alive at 100 concurrent threads but it was very slow (2200 req/s at 12% CPU), idling on I-don't-know-what. And at 1000 threads (still 100 concurrent connections), it immediately ate all my memory and got killed by the OOM killer :
Out of memory: Kill process 7134 (filed) score 779 or sacrifice child
Killed process 7134 (filed) total-vm:7680976kB, anon-rss:7140908kB, file-rss:1988kB, shmem-rss:100kB
oom_reaper: reaped process 7134 (filed), now anon-rss:7143588kB, file-rss:1980kB, shmem-rss:100kB
7 GB for 100 connections is a bit excessive in my opinion :-)
There are definitely still a number of issues to be addressed before it can be used in production, you definitely need to have a more robust architecture and request parser first.
So you need as many threads as you're delivering parallel files ? That reminds me the late 90's when everyone started to discover how easy it was to deal with blocking I/O using threads until they discovered that threads are very slow since you have to suffer from context switches all the time. As a rule of thumb, you must never run more threads than you have CPU cores, or you'll get the taste of context switches.
And it's really hard to scale using a model requiring one thread per connection. You'll hardly stand one million concurrent connections this way, and this can definitely happen for large objects.
As you're the author and you're promoting your own work, may I ask what sort of testing this claim is supported on?