Its great to see companies sharing what they've learned from experience about system architecture. So many times this sort of stuff is very difficult to plan out and the only real way to get it right is through experimentation. Having first hand descriptions like this is a great resource if you are setting something up the first time.
I like that they posted their unicorn config file too!
When the Unicorn master starts, it loads our app into memory. As soon as it’s ready to serve requests it forks 16 workers. Those workers then select() on the socket, only serving requests they’re capable of handling. In this way the kernel handles the load balancing for us.
Wasn't there a heated debate here just the other day about the prefork model?
Has any mainstream web server ever been pure demand-forked? Apache has been connection pooled since the '90s; the second edition of Unix Network Programming used it as a case study.
Exactly. Hence the "pre" in prefork. A "forking" web server (one fork per request) would be incredibly inefficient, whereas prefork is only... slightly inefficient. :-)
Ho ho ho. Back in the day, people used to run httpd off of inetd. I was real smug back then when I replaced that at our site with NCSA httpd, which did indeed fork on demand. Think this was '93 or thenabouts.
NCSA HTTP was released at the end of '93. Apache was the first mainstream server to prefork a connection pool; it's not demand-forked. There wasn't much advantage to running NCSA standalone vs. out of inetd.
I believe Apache was '95; it came out when I was working at an ISP, and I graduated from high school in '94.
Except that Apache (and Passenger) don't just start up N worker processes - they can dynamically allocate them as the need arises. Perhaps this is less important for a site like github, which can plan exactly how many to use, but if you have more than one site on the same server, it's really convenient, and, IMO, makes Passenger the best choice for deploying Rails for most people.
So it appears as if this setup replicates a small portion of what Apache does without all the bells and whistles.
The argument encompassed two aspects of the model:
1) "Threads are out -- processes are better than threads."
2) Process-per-connection architectures.
The first is demonstrably false -- for instance, look at Erlang, which maps lightweight erlang processes to operating system threads, providing SMP scalability at a low cost without running into Github's issues with mongrel "thread-killing". More broadly used, look at Servlets and the Servlet 3.0 support for async comet-style event-based request handling. Each request is handled on a thread as necessary. Inter-thread communication (where necessary) is cheap, and this scales just fine.
If you use the fork() model and a long-running request blocks the entire process, comet is basically a non-starter. This is why people are interested (and implement) lightweight threads, coroutines, and restartable request implementations.
For a conceptual challenge, consider how you would implement a live web chat system that can scale up to a considerable number of clients, with extremely low resource usage, and instant message distribution (no polling). Locally, we implemented this with async servlet support and M:N scheduled scala actors -- a blocking HTTP request doesn't hold a thread or process hostage in the web server or the application, and we can scale up to enormous number of live clients on one machine.
The second was merely a lack of understanding of the model. In implementations where fork() is used as an alternative to threads and multiple connections are not handled per sub-process, you quickly run into scaling issues with subprocess memory utilization. In cases where subprocesses handle multiple connections via an event mechanism, you're just using fork() instead of threads.
I'm wondering how many servers github is using to keep up with load -- If each 8 core 16 gigabyte (!!!) server can actually only handle 16 concurrent requests via a pool of 16 workers, that's an incredibly poor (and expensive) scaling model.
Each box can handle plenty more than 16 workers, but we currently don't need that many, so it's pointless to run them. Our frontend machines generally sit about 60% idle. We wanted to have plenty of headroom on our new setup.
Also, our Ruby application code spends very little time blocking on external resources (compared to time spent in Ruby execution), so having async execution on these wouldn't give us that much more efficiency. It's true that running Ruby code is generally more expensive (dollar-wise) per given task than other languages, but we made a conscious decision to make that tradeoff for the productivity gains that Ruby affords us. We use Erlang and EventMachine in other pieces of the architecture where they make sense.
The simple fact is that Unicorn works better for us (and our specific use case) than anything else we've tried, and so we're using it.
Each box can handle plenty more than 16 workers, but we currently don't need that many, so it's pointless to run them. Our frontend machines generally sit about 60% idle. We wanted to have plenty of headroom on our new setup.
At those numbers, it sounds like you'd max out at roughly 32 workers (so 32 concurrent requests), or am I missing something important?
It's true that running Ruby code is generally more expensive (dollar-wise) per given task than other languages, but we made a conscious decision to make that tradeoff for the productivity gains that Ruby affords us.
It's really easy to measure performance, but productivity claims seem difficult to support and highly susceptible to confirmation bias.
I think that Ruby, Scala, and Clojure (just for instance) would stack up against each other very well in terms of productivity, but I wouldn't know how to prove it.
The github developers are great ruby developers, and I'd be very surprised if they would be as productive in the short term in a different language. I do not see a business case for switching platforms for them.
You seem hung up on the number of concurrent requests. An 8 core machine can only do 8 things at once, no matter if you are using threads or processes.
You seem hung up on the number of concurrent requests. An 8 core machine can only do 8 things at once, no matter if you are using threads or processes.
You can't assume that the threads are 100% CPU bound. If they were 100% CPU bound, you would have a point -- 8 cores, 8 CPU-bound processes, no processing time left over -- that is where event based and hybrid thread/event-based architectures excel.
In reality, webapp threads will sleep -- waiting on the network, waiting on disk, waiting on the database. When they sleep, the the processor has nothing to do. If the OS can schedule another thread while one sleeps, work can move forward.
If you can run 500 threads to completion in 10ms, then you can serve 500 requests within 10ms.
If you can run 16 (or 32) processes to completion in 10ms, then you can only serve 16 (or 32) requests within 10ms.
I think we all understand the difference between threads and processes and where context switching fits in.
In Linux 2.6, there's not a lot of difference between switching between processes and threads.
From the github dude: "Our Ruby application code spends very little time blocking on external resources." That is why using threads in this particular case won't help.
I think we all understand the difference between threads and processes and where context switching fits in.
You might, but I don't think that's the general case.
From the github dude: "Our Ruby application code spends very little time blocking on external resources." That is why using threads in this particular case won't help.
It hits a database, loads content from memcached, and waits on HTTP requests to complete, does it not?
I'd be surprised if the usage modeled a pure event-driven sendfile() based static-only file server (such as lighttpd), for instance (where threads vs. processes is moot).
Moreover, a fair amount of effort has been expended on a complex architecture to push only a very specific type of requests to Unicorn.
Sure, now that HN has decided that enough time has elapsed to allow me to reply. :-)
Actually, antonovka did an excellent job explaining it above. The most important aspect is that threads are "lighter weight" than processes. They use less memory and are quicker to context switch (usually). The result of this lighter weight is that you can spawn more threads than you could processes, on the same hardware. And when you're using one thread/process per connection, that means more concurrent connections on the same hardware. So, if Unicorn used its exact same architecture, but replaced the worker processes with worker threads, you could scale much better.
Secondly, haproxy is written in C, which generally means it's going to perform much better and use a lot less memory than a Ruby webserver. This translates, once again, to more output from the same amount of hardware.
That's why I was pretty surprised to see that github would choose both Ruby and pre-fork over C and threads (or in the case of haproxy, async, which scales even better).
I think you missed the point. The point was not that github stopped using haproxy, but that it no longer needs to use it.
Assuming haproxy can work with unix sockets, github could configure nginx/unicorn to use it, but with unicorn its current work load does not require additional load balancing acrobatics beyond what the unix socket does.
Similarly, many rails sites use thin/nginx or mongrel/nginx and do not even have a workload that necessitates haproxy.
Plus, haproxy adds another layer of complexity, configuration, and management, which is nice to avoid if you can.
> Actually, antonovka did an excellent job explaining it above. The most important aspect is that threads are "lighter weight" than processes. They use less memory and are quicker to context switch (usually).
Processes and threads have very similar execution models under most Unixes from what I understand. Threads don't use all that much less memory, either, given a copy-on-write friendly environment (e.g., not Ruby MRI). Perhaps you're confusing threads vs. processes under Windows or Java to threads vs. processes under Unix?
> The result of this lighter weight is that you can spawn more threads than you could processes, on the same hardware. And when you're using one thread/process per connection, that means more concurrent connections on the same hardware. So, if Unicorn used its exact same architecture, but replaced the worker processes with worker threads, you could scale much better.
You're crazy. If Unicorn were 1.) somehow able to take advantage of native threads (it's not), and 2.) moved to a thread-per-connection model instead of a process-per-connection model, it would have basically zero practical impact on the efficiency with which it processes requests. It's already sharing a great deal of its base memory footprint thanks to preloading in the master and fork, so processes don't have significant memory overhead. It's using the kernel to balance requests between processes, so it's not like there's a bunch of IPC between master and worker processes that could be removed with threads.
Say you're running 8 process-per-connection backends on eight cores and each process is 100% CPU bound processing requests. You have a load of 8.0, machine is 100% utilized. If you then change this to a single process with 8 native-thread-per-connection workers, absolutely nothing will change. The load will still be 8.0. You can start more threads to do work but it will have the near exact same effect as starting the same number of processes.
Process-per-connection doesn't fall down under these levels of concurrency. It does eventually - you can't use process-per-connection to solve C10K problems, for instance. But we're talking about Ruby backends, which are always managed as multiple processes due purely to the way Ruby web apps are written (not efficient, not async). Requests execute within a giant request-sized Mutex lock.
And even if threads were considerably more efficient than processes, you still don't want to run a lot of them because each consumes network resources, like database/memcached connections. Using native threads would not let Rails apps spawn thousands (or even hundreds) of thread-per-connection workers. A high concurrency threading model works for Apache (and Varnish and Squid) because the work performed in each thread is fairly simple and doesn't require the kind of network interaction and resource use that an app backend does.
Basically, this notion that threads (even native threads) would be a considerable improvement to Unicorn's design is just all wrong.
> Secondly, haproxy is written in C, which generally means it's going to perform much better and use a lot less memory than a Ruby webserver. This translates, once again, to more output from the same amount of hardware.
Your world seems considerably more simplistic than mine. I'm not even sure how haproxy and unicorn can be compared in any useful way. Unicorn is not a proxy. The master process does not do userland TCP balancing or anything like that. Unicorn is designed to run and manage single-threaded Ruby backends efficiency, HAproxy is a goddam high availability TCP proxy.
> That's why I was pretty surprised to see that github would choose both Ruby and pre-fork over C and threads (or in the case of haproxy, async, which scales even better).
I think you're confusing the types of constraints you have when you're building high concurrency, special purpose web servers and intermediaries (like nginx and haproxy) with the types of constraints you have when you're building a container for executing logic-heavy requests in an interpreted language. Nginx and HAproxy have excellent designs for what they do, and so does Unicorn. They're different because they do different things.
The 16 worker processes only handle dynamic requests, not static. With proper http caching, ESI, etc, it's quite likely that amount amount of concurrency is more than enough for their needs.
Say that when all the workers are being utilized, the average dynamic request takes 50ms (which is on the high end). That means that each box can handle 320 dynamic requests per second. Which is a decent amount, and I'd be surprised if they see that much traffic.
If they are using http caching and ESI properly, the average request time would be significantly lower, and the requests per second would probably be > 500/s on that box.
If they are using http caching and ESI properly, the average request time would be significantly lower, and the requests per second would probably be > 500/s on that box.
500 req/s is pretty abysmal for a 8 core 16 gigabyte server. I realize that some of that has to do with Ruby performance, but yikes -- that is frighteningly bad scaling, regardless of the cause.
A number like 500 req/s is irrelevant without an understanding of what the site has to do. For a simple site that does simple database lookups, I'd expect more than that. For a site like GitHub that has to do database lookups and pull large amounts of data from Git repositories, it's an entirely different story. Without knowing the split between cache hits and misses, the number is even more irrelevant. You're arguing in abstracts, whereas I am constrained to actually working with real life. I'd love to see some examples of the type of sites you're running and the solutions you employ wherein 500 req/s of dynamic requests on an 8 core machine are considered "frighteningly bad scaling."
How can you say it's bad without knowing how knowing what each request is doing?
Github isn't a microbenchmark.
A concrete example:
I can say that one of my rails sites handles 850 dynamic requests per second running on a single small 1 cpu server. That's because all that particular request does is lookup 4k of data from memcached and returns it. (i.e. http://www.tanga.com/feeds/current_deal.xml)
However, as a general rule, I know that each small server can handle about 30-50 pages per second, because each page takes a lot of data crunching to generate (and because I haven't been bothered to make it as efficient as possible, it's fast enough as is).
If all I was doing was returning a small bit of text that didn't require much lookups or calculation, then sure, a 8 core cpu with rails could probably do 4-5000 requests per second easily.
"Look at Erlang, which maps lightweight erlang processes to operating system threads, providing SMP scalability at a low cost..."
Erlang can use native threads, but they do not map to its lightweight processes. Erlang's lightweight processes are basically green threads. Native threads are used for SMP scheduling only. So on a four-core system, you might have thousands of Erlang lightweight processes running on just four OS threads in one OS process.
Erlang can use native threads, but they do not map to its lightweight processes.
That's not what I said; I said: "[Erlang] maps lightweight erlang processes to operating system threads."
Erlang processes are green threads mapped to OS threads. They're called processes in erlang, but they're M:N scheduled threads.
So on a four-core system, you might have one thousands of Erlang lightweight processes running on just four OS threads in one OS process.
Or you could have thousands of Erlang lightweight processes mapped to 185 OS threads (not that it will, but it could). The fact that they're green threads is irrelevant in terms of scalability -- they're still threads, but even less resource intensive than a direct mapping to OS threads. M:N scheduled green threading is also very hard to do in the general case, which is why you don't see more of it (see the abandoned attempts to implement generic M:N OS:green thread scheduling in FreeBSD, for instance).
You could also be using Scala, where an actor only maps to a thread when it's blocked inside of its react loop, but in that case will consume a whole thread.
The Erlang model is to utilize poll/select/whatever together with a language that contains its own scheduler.
If you execute arbitrary python/ruby/tcl/php code, there's a chance it will block. If you execute arbitrary Erlang code, the chance of that is very low. This means that you can handle all kinds of long running calculations in the same OS process, while you continue to handle incoming requests.
The fact that Erlang farms out a few threads is a later optimization added to the language several years back in order to take advantage of SMP. However, it only creates a number of threads to match the number of processes, IIRC. After that it shouldn't spawn any more, so I don't think you would particularly call it an example of a 'threaded model'. Erlang's big advantage is the internal scheduler, and a systematic approach to writing code that will never block the Erlang OS process.
Solaris 2-8 implemented a general purpose M:N thread scheduler mapping user-space threads to kernel threads. This is the exact same solution Erlang implements for the less general purpose use-case of actor messaging.
Is Solaris' M:N thread implementation somehow not 'threaded'? If it is threaded (and it is), then how is Erlang's implementation not also an argument for threads?
To elucidate rather than rely on Erlang as an example, the argument for threads over processes:
1) Extremely low-cost alternative to IPC. Threads allow (but do not require the use of) shared mutable state. This is a much, much cheaper way to communicate between concurrent entities. You can achieve the same effect with processes and shared memory, but it's significantly more complex to implement and subject to the disadvantages listed below.
2) Extremely low memory foot-print. A thread costs a stack plus minor OS book keeping. A fork(2) can leverage COW pages, but almost invariably the number of non-shared pages will be significantly higher than with a thread. If an operation blocks a thread, it's cheap to create more. If an operation blocks a process, you'll hit resource constraints far more quickly trying to fork() more.
Of course, leveraging shared memory and other operating system tools, you can turn a fork(2) implementation into a thread alternative -- but then, you'd have ... threads. This is what Linux's clone(2) syscall was intended for -- a thread-implementation friendly fork(2) alternative -- and the pthread library was built on top of it.
... for a while, you could call setuid() in a Linux "thread" and the new uid would be a thread-local change, because the thread was actually a 'process'.
This is effectively not an argument against fork(2) (although it's an expensive route to the same solution offered by threads), but rather against scaling models that will block an entire OS process for the sake of a single request.
leveraging shared memory and other operating system tools, you can turn a fork(2) implementation into a thread alternative -- but then, you'd have ... threads
There's an important difference: in the threading model, all memory is shared by default, whereas in the processes + shm model, you need to explicitly allocate and access shared memory regions. Because sharing mutable state is such an important design consideration, being explicit about what is being shared improves reliability by reducing the chance of unintentional sharing. A related benefit is that processes are protected from one another by the virtual memory mechanism: it is much more feasible for a service to recover from a crashed process than to recover from a crashed thread (which typically takes down the entire process).
Threads are a 'nice extra' for Erlang. They're definitely a plus, no arguments there, but in your M:N mapping, N is fixed at the number of CPU's in the system. So if you have one CPU, Erlang doesn't really use threads at all, and yet it is still very scalable. The scalability on our one-CPU system comes from being able to use a nice, compact, select/poll based system and keep everything in one OS process, without spawning additional threads or OS processes.
So, threads are really just an optimized case of launching N Erlang vm's, where N is the number of processes, and splitting the work between them. They were a late addition to the language too - this only happened several years ago, if I recall correctly.
In other words, threads are a nice boost to Erlang, but not really the secret to its success.
In other words, threads are a nice boost to Erlang, but not really the secret to its success.
I think you missed my meaning. On FreeBSD 4.x and Linux 2.0.x, threads were implemented entirely in user-space. They didn't allow for true concurrent execution of multiple threads on multiple CPUs, but they were still very much threads.
Likewise, erlang's lightweight processes are exactly the same -- threads. The fact that they can be modeled to arbitrary numbers of operating system threads is irrelevant to the nature of the model -- one in which execution of a thread can proceed independently of others, implemented via context switching of execution state across those threads, able to access shared state.
However, in the fork model as described in the original article, processes are the only form of concurrency. A blocked request will, in turn, block a process, and unlike threads, far fewer processes may be run concurrently due to their significantly increased resource constraints. Furthermore, those processes are much more limited in their ability to implement low-cost inter-thread communication via shared mutable state.
If the processes actually relied on their own internal M:N scheduled green threads, then at least that part of the concurrency problem would be (mostly) solved, and fewer processes would be required. IPC is still an issue, and of course, there's the multi-decade demonstration of the high difficulty in implementing a 'performant' general-purpose M:N scheduled thread system.
> Likewise, erlang's lightweight processes are exactly the same -- threads. The fact that they can be modeled to arbitrary numbers of operating system threads is irrelevant to the nature of the model -- one in which execution of a thread can proceed independently of others, implemented via context switching of execution state across those threads, able to access shared state.
From Wikipedia:
> The Erlang virtual machine has what might be called 'green processes' - they are like operating system processes (they do not share state like threads do) but are implemented within the Erlang Run Time System (erts). These are sometimes (erroneously) cited as 'green threads'.
Going back to Ruby, sure, threads might be theoretically better than processes, but in practical terms, other things are the real bottleneck, so it ends up not really mattering that much.
> The Erlang virtual machine has what might be called 'green processes' - they are like operating system processes (they do not share state like threads do) but are implemented within the Erlang Run Time System (erts). These are sometimes (erroneously) cited as 'green threads'.
Erlang processes' implementations share internal mutable state, despite the fact that the inter-process messaging API available via the runtime does not expose this. They have to -- internal cooperative context switching can't be implemented without shared mutable state.
They're threads.
Going back to Ruby, sure, threads might be theoretically better than processes, but in practical terms, other things are the real bottleneck, so it ends up not really mattering that much.
I'm not sure I follow. Is this something like "ruby is really, really slow, so no need to worry about concurrency" ?
I don't understand why people insist on architectures where otherwise-independent processes share a single socket.
You're already running a reverse proxy in front of them! There's no reason each Unicorn couldn't be listening on a different port. Does that third layer of local load-balancing between the HTTP proxy and the event-driven app server actually get you anything?
It's a general result in queue theory that you want one global queue as early as possible in the system. This is because requests don't take exactly the average service time to clear backends: it's a distribution. The more workers that can pull from a queue, the less the worst case service times affect the average service time.
One queue per worker with no global queue is the worst configuration and should be avoided if at all possible. Anyone who's run large reverse proxy installs knows this pain well.
The ideal system would be for the balancer machine(s) to hold the requests, and for backends to pull them in a sort of ping/pong fashion. I gather fuzed runs in a pattern like this, though I've not used it.
Telcom folks have analyzed this stuff in detail for the better part of a century. There's a lot of theory out there and it's surprisingly practical and applicable to real world web applications.
You read me correctly, I don't see why load balancing should be taking place down at the app server level.
Aren't you going to have multiple machines each with their own blessing of unicorns? You're still going to have to use some kind of load balancer in front of independent sockets.
Replacing N ports with one simplifies configuration.
I never understood the complex HAProxy in front of Apache in front of Nginx in front of Mongrel type setups that seem to be popular in the Rails world. Why not just use Unicorn? What value is GitHub getting from having Nginx in front?
Because Ruby 1.8 threading sucks you pay a large memory price (ie a process) for each concurrent request in flight. A fronting proxy allows your backends to write out the response as fast as possible and move on to another request while the proxy spoon feeds the response to slow clients.
Also, nginx is going to be more efficient for serving static files, though most larger apps will have broken such requests out to a separate set of domains likely serviced by a cdn.
It seems to me like the Passenger guys could easily add an option so that on a "touch tmp/restart.txt" a set of new worker processes is started before the old ones are killed off. I imagine this would make this slowness a thing of the past. For the record, my apps experiencing this momentary queueing and slowness on restart (5 seconds max).
To clarify, my understanding is that while rails is restarting, Passenger will queue all the requests that come in and begin processing them as soon as rails is ready. But yeah, zero downtime, it's pretty awesome.
Yeah. We load balance across 5 Apache/Passengers and I do rolling deploys (all in Capistrano) by removing a Passenger from load balancing, updating the app, restarting Apache, and adding it back into load balancing with a 10 second delay between each. We tried the Passenger touch restart.txt and that didn't go well at all when we were under load.
He has been kind enough to share about his setup quite a lot about lately. In fact, looks like it's on the HN homepage right now: http://news.ycombinator.com/item?id=872301
Also, if you're using Monit to watch the memory bloat of your passenger processes, when you kill one process it tends to do the same stalling thing on all your processes until the ONE you killed is back up.
I like that they posted their unicorn config file too!