We've used both round robin and least outstanding connections in AWS and found many really weird phenomena over the years.
In general, LOC is more forgiving for uneven workloads, especially for Ruby services. If a slow request hits, that machine gets less traffic than the others and gets some room to recover and burn off its queue.
Where we found weird problems was at the extremes: very low concurrency and exceptionally high concurrency.
Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.
The other extreme was 1k+ websocket connections per host. When we autoscaled, it would add a new host. That host would have 0 connections and so the next 1k connections would all go to the new host. For that system, we changed it back to round robin.
> Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.
I assume you're saying that this happened over a period much longer than the lifetime of a connection? If all of the connections are super long lived, I'm not really sure what else the load balancer could do, other than forcibly close one and make the client swap to the other server, but naively that doesn't really seem like something I'd want a load balancer to do.
Even more basic, wouldn't that just make the split 3:2 instead of 2:3 (aka, the same thing)? I don't think we can give each server 2.5 connections (though that would be cool!) ;P.
Yeah, I guess what I meant is that forcing one connection to swap back and forth at some interval would make it 1:1 on "average" over a longer period of time, but it seems presumptuous to think that would perform better, since the cost of having to re-establish a connection each time and then potentially communicate any "state" that would be needed to resume from whatever context was being executed before doesn't seem like a good tradeoff (not to mention the huge increase in complexity of the logic the client would need to have to handle this).
Alternate the incoming connections from one server to the other. That gives the one with two connections some time to clear its queue while the other queue may be getting longer.
The problem was that it was the same one machine that always had the extra 50% traffic. The eventual fix from Amazon was to randomly select in a tie breaker so that each machine had, on average, 2.1 concurrent requests (assuming 10 machines).
I can't seem to find the article, but I remember some FAANG - FB I think - having an excellent blog post about a very hard to track down bug in their LB where performance fell off a cliff. They eventually tracked it down to an LRU and realised that the least loaded was actually the least hot and performance was substantially better if they tried to run small number of servers at capacity rather than a large number of servers all lower utilized.
Funny you mention that. When I worked there I identified a bug in an internal load balancer where new requests were being sent to the most overloaded hosts. It was far worse than if load balancing had been completely random and meant tens of thousands of servers were mostly sitting idle. The few lines of code to fix that bug likely amounted to the biggest financial impact I’ve had on a company in my whole career. Unfortunately, I didn’t have time to quantify the fix in monetary terms because I was in the midst of a reorg and the bug was totally unrelated to my main project. Kind of hilarious in retrospect to think about how such a giant problem went unnoticed for so long.
By LRU you mean caching/preloading being terrible (and there being a lot of latency) because there wasn’t enough load to allow things to properly preload?
For sufficiently high volume services where loads can be uneven, we had a lot of success with thermostat adaptive balancing. The backend node reports how hot it is to the LB and gets a relative weight. If it can take more connections, it raises its relative weight, as it takes on more load, it lowers its relative weight.
This worked well on a fleet of a couple thousand servers handling thousands of requests per second each. I can't recall why, but lower volume and few servers caused uneven load, and we went back to least connections for those clusters.
> The backend node reports how hot it is to the LB and gets a relative weight. If it can take more connections, it raises its relative weight, as it takes on more load, it lowers its relative weight.
Some load balancers support readiness checks. When a backend is over-worked it can fail readiness to excuse itself from more work. We implement one variation of this by looking at load metrics (cpu, disk, mem, net) [0]. readiness is binary (ok / !ok) and not weighted (ex: no. of conns). The backend, however, is free to load shed (reject already sent requests) as it sees fit.
Load averages can be very useful, but it's problematic to combine them with something like readiness checks. I must point out the corresponding comment in loadavg.c which is required reading:
The parent's concept of reporting "hotness" is really powerful. It lets a central authority (the load balancer in this scenario) make an informed decision about where to send traffic: the coolest backend. In the case where all systems are actually quite hot, it probably makes sense in most scenarios to continue sending traffic as best-effort.
In the case of readiness checks, nacking a check removes the backend from the pool of eligible new-traffic recipients. At least this is true for systems I've encountered. The solution you propose is problematic as it's possible for all systems to decline traffic resulting in no available backends -- rarely a desirable state -- which the load balancer usually can't do much about.
A good system will allow you to decouple the signals (loadavg, connections, latency, CPU, etc.) and the health decision (readiness checks) and the interpretation of that decision (load balancer policy) to provide the best of both these worlds.
Now, there are many scenarios where simply excusing yourself works fine -- processing expensive batch jobs, for example -- so this can work nicely, but for typical production traffic scenarios I'd advise against this approach.
presumably you want your service to be up. If you can smooth the load out to other nodes, your service stays up. If you have a bad strategy, you start losing capacity as nodes get removed.
Let's say each node can handle 5k rps and you have 10 nodes. You can handle 50k rps. If you are receiving 40k rps, a good strategy will put each node at 80% capacity. A bad strategy will knock a node out, reducing your total capacity, putting extra pressure on the rest of the system, causing more failures, and more pressure. This is called a thundering herd.
At some point, your only option is load shedding. But with a bad LB strategy, you start load shedding much much earlier than you should. This is a bad experience for customers that is avoidable with good LB strategies.
Fair. Our solution is to make sure a biased load balancer (like in TFA) isn't sending the workload towards a select few machines while others may be not be as over worked, in as simple way as possible.
We run load balancers in a fail open mode. As in, if every backend is excusing itself, then none are excused.
But as you point out, load balancing is a hairy beast unto itself.
in our case, the idea was not to shed all load (bool on/off), but to lower load slowly and to smooth traffic. Another confounding factor was that there were multiple services running on a single node and any of the services may be the one needing to push back a bit. We were able to have each service control the load onto the given node this way -- and if all nodes reported they were overloaded, they would all end up with the same weight meaning we kept processing as much as possible. If we just pulled them from the lb, you could crash the entire cluster during a period of overload.
> lower volume and few servers caused uneven load, and we went back to least connections for those clusters.
The more traffic you’re handling, the more the central limit theorem applies as you are summing the behavior of lots of random events drawn from various random distributions, and the more the system behavior regresses to the mean.
I think of it as analogous to how when an engine is idling it tends to pop and judder, but as you rev it up the smoother it sounds until at sustained high revs it can make almost a pure note. All the variations are smoothed out and the system runs much more evenly and predictably, but when it drops to idle speed the subtle variation in the fuel air mix and the pressure in the exhaust and inlet every stroke makes it chug and sputter chaotically.
To be clear: by "hot", you literally mean physical temperature? That's rather ingenious, i like it! I wonder i you could also measure power consumption.
As an aside, I worked on path tracing[1] long ago, and there was various schemes to avoid the "clumpyness" you'd get by using purely random numbers.
This included regular stratified sampling[2] as well as using quasi-random low-discrepancy sequences[3] like the Sobol sequence[4], which would "space out" the random samples.
Probably not terribly relevant for load balancing, but I did find the quasi-random stuff fascinating. How to make something "random enough" without being "too random".
edit: Though now I got a shower thought moment, could stratified sampling be used as a variant of the "two choices", even if each load-balancer worked independently? That is, each load-balancer stratifies the resources, and keeps a separate index to the current bin. Then for each request pick a random/suitable resource from the current bin and increment the bin index, wrapping when needed. Hmm...
If every request took exactly the same amount of cpu time and wall time and same amount of ram and other resources, then random load-balancing is just fine for a sufficiently large number of machines. But there in lies the problem. Every request cost is different. And they have noisy neighbor consequences to other requests on the same machine etc. Hence, any naive load-balancing will lead to hotspots. If every request/response contained a X-Health: header that is understood by the load-balancer to decide how much headroom is available and if every request also had deadlines, then load-balancer can do interesting things with it to avoid hotspots, drive near-uniform utilization, avoid bad tail-latencies and provide near-ideal back pressure to upstream.
> If every request took exactly the same amount of cpu time and wall time and same amount of ram and other resources, then random load-balancing is just fine for a sufficiently large number of machines.
The article points out random load balancing scales sub-linearly with increase in capacity?
If we distribute a set of items randomly across a set of servers (e.g. by hashing, or by randomly selecting a server), the average number of items on each server is num_items / num_servers. It is easy to assume this means each server has close to the same number of items. However, since we are selecting servers at random, they will have different numbers of items, and the imbalance can be important.
...
This is a classic balls in bins problem... the summary is that the imbalance depends on the expected number of items per server (that is, num_items / num_servers). This means workload is more balanced with fewer servers, or with more items. This means that dividing a set of items over more servers makes the distribution more unfair, which is a reason we can get worse than linear scaling of a distributed system.*
The client doesn’t know anything about your request implementation, so sending that via a header doesn’t make sense, unless it’s added by some middleware that does.
The load balancer could keep track of how long requests for a specific path tend to take, and what servers are currently handling and use that information to be smarter about routing requests. I wouldn’t be surprised if that already exists in things like haproxy or nginx.
There’s issues where people stuff data into the URL, like ids or keywords for SEO. But that’s possible to workaround.
There are other techniques that are optimal like Parallel Depth First scheduling from Blelloch.
Since then there has been a lot of work for distributed work-stealing.
Boggles my mind that at "cloud-scale" people still use round robin and assume uniform workloads and that at the server level there is uniform load as well.
I wish load balancers would default to one of these optimal algorithms. When I call a "sort" function from the standard library, 99% of the time I don't have to go read papers to pick which sorting algorithm to use. The default algorithm is good for most cases. EDIT: probably because many of them are adaptive algorithms.
Seems many load balancer providers missed that memo.
The issue is that the spread of actual cost of requests is usually extreme and opaque, and system load/capacity information is often very different between nodes.
And there is no standard way I’m aware of to usefully report back or communicate this data to the LB.
Probably because this would require both the load balancer and the webserver to support a custom protocol. My first guess is that some service meshes have an API that you could use for this.
All things which have too few bins under random are at risk of asymmetry. Some LB situations take some approximation to load/responsiveness, and back off the counts as a function of observed pace of response. TCP RTT will do, size of the pending Queue for async responses too, time in queue, there's lots of measures. Your liveness check to the backends probably provides this simply by how responsive to heartbeat it is.
Given tasks do not run in identical time, there's going to be variance come what may.
I would suggest a pool of 3 should tend to be more stable. But, your platform people may be recommending that 3 and below need to be sized to work as 1, only 4 and above can start to assume a minimum cohort (2?) exist.
Sharding is not LB. It's just a hash approximation to grouping things. Maybe you're telling a story about using a random() to emulate how far out of 50/50 a 2 part sharding story can get?
Routers doing LB may well do simple prefix triage. You may be stuck on one side of the LB because of who you are, and nothings going to change it.
(I don't do this for a living, wiser people may tell you I'm wrong)
It sounds like part of the issue is that people think random == evenly (but unpredictably) distributed on a specific (and relatively short) timescale.
For a truly random source, the odds of getting the same number in the next roll are identical to getting it on any prior roll. For example, flipping a coin twice is as likely to get you heads twice as it is to get you heads/tails.
eventually things even out. But if it’s truly random, when that is going to be the case should also be completely unpredictable. It might end up being after the service is deprecated, for instance. You might end up never getting a specific ‘desired’ answer ever. For instance, if the test is flipping a coin and we do it 10 times, there is a small chance we never get tails and get 10 heads in a row from a truly random source.
That is working as expected.
Usually (at least with load balancing) they instead want a smoother distribution, where it’s less likely to get the same answers they just got, when they just got them.
So for example, they’d prefer that if they just got a heads they’re more likely to get a tails next (but it won’t always happen).
Notably, this isn’t really random. It’s also a really hard thing to do in practice without also getting all nodes in lockstep, being predictable to an outside attacker, or having internode communication. Maybe impossible, I don’t remember.
Typically network LB will look at the entire layer 4 tuple to overcome things like NAT clustering traffic from many users into one place. Not to say IP lb doesn't happen, but with NAT being de-facto almost nobody does it IME.
Worth noting the difference between an AWS Application Load Balancer (ALB) that is HTTP request aware, and Network Load Balancer (NLB) which is not, when load balancing HTTP traffic.
AWS ALB (and others I'm sure) can balance by "Least outstanding requests" [1] which means a server with the least in-flight HTTP requests (not keep-alive network connections!) to an app server will be chosen.
If the balancer operates on the network level (eg NLB) and it maintains keep-alive connections to servers, the balancing won't be as even from a HTTP request perspective because a keep-alive connection may or may not be processing a request right now and so the request will be routed based on number of TCP connections to app servers, not current HTTP request activity to them.
It's surprising that "least requests" is still so less commonly used than random or round-robin. It's a simple metric and yet it handles work-sharing remarkably well, without any need for servers to communicate their load back to the LB. (As you correctly say, the LB needs to be aware of idle/keep-alive connections in order to balance properly)
[Also, to be pedantic, it should really be called "fewest requests" - but I've never seen a product call it that!]
The more unrealistic assumption is that every work item or request takes uniform processing. Things really blow up when you have a pile of p99 slow requests.
This. Load balancing based on count assumes every request is served in equal time and resources.
I think what the article misses highlighting is that this is about stateless load balancing which improves throughput at the LB, at the cost of potentially suboptimal distribution.
The other nit I have is calling hashing random, it's not, and a lot goes into selecting hash algorithms for LB that don't result in mass movement of requests every time a server is added or removed from the pool.
If the author wants stateless balanced load and assumes all requests are equal, why not use an approach like round robin?
Beyond that, you can get into more stateful methods that look at CPU use on servers, or any other metric you like to load balance on the resource you care about, ie memory or CPU vs number of open sockets.
Round Robin can get weird when you have common factors between the number of subrequests for a single request and the number of servers. I’ve seen everything from certain pages reliably blowing up in prod to weird caching bugs that don’t trigger in preproduction because the same box sees all of the requests.
Developers have a nasty habit of assuming a glitch or someone else changed something and if a reload fixes it they don’t note the pattern.
In load balancing they call this persistence. You are describing a case where you should persist to the server you land on.
In this scenario, unless each request is exactly the same you will end up with uneven load as some sessions persist longer than others.
This can still be stateless with use of a cookie for protocols that support it but is not the idealized "round robin and every request takes same time and resources".
The other alternative is a stateless app that is architected for round robin lb.
Chained microservices exacerbate this and people give you blank looks when you say it. Then you have to draw a picture because if that gets blanks then they have no idea what the Central Limit Theorem is.
So, the real problem isn’t noted until towards the end which is that these methods require keeping some state, thus making the requests more expensive.
It’s a good trade off when throughput is low and the impact of lumpy traffic is high. At higher volumes, random allows stateless load balancing.
Use the tool that makes sense for your project. Even at the same system I’ll have some services that operate at low volumes with high latency and others with high volume and low latency.
> DNS should perform round-robin in the order of the IPs in the UDP packet.
That only works if the clients choose the first IP address in the response to connect to. But IPv6, IIUC, specifies that clients instead always use the IPv6 address it has the longest stretch of bits in common with, compared to its own IPv6 address.
There is a interactive visualization that demonstrates different load balancing techniques, but I cannot for the life of me find it again. Maybe someone here knows which page i'm talking about and could provide a link?
In general, LOC is more forgiving for uneven workloads, especially for Ruby services. If a slow request hits, that machine gets less traffic than the others and gets some room to recover and burn off its queue.
Where we found weird problems was at the extremes: very low concurrency and exceptionally high concurrency.
Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.
The other extreme was 1k+ websocket connections per host. When we autoscaled, it would add a new host. That host would have 0 connections and so the next 1k connections would all go to the new host. For that system, we changed it back to round robin.