Random Load Balancing Is Unevenly Distributed

phamilton · on Aug 31, 2023

We've used both round robin and least outstanding connections in AWS and found many really weird phenomena over the years.

In general, LOC is more forgiving for uneven workloads, especially for Ruby services. If a slow request hits, that machine gets less traffic than the others and gets some room to recover and burn off its queue.

Where we found weird problems was at the extremes: very low concurrency and exceptionally high concurrency.

Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.

The other extreme was 1k+ websocket connections per host. When we autoscaled, it would add a new host. That host would have 0 connections and so the next 1k connections would all go to the new host. For that system, we changed it back to round robin.

saghm · on Aug 31, 2023

> Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.

I assume you're saying that this happened over a period much longer than the lifetime of a connection? If all of the connections are super long lived, I'm not really sure what else the load balancer could do, other than forcibly close one and make the client swap to the other server, but naively that doesn't really seem like something I'd want a load balancer to do.

saurik · on Aug 31, 2023

Even more basic, wouldn't that just make the split 3:2 instead of 2:3 (aka, the same thing)? I don't think we can give each server 2.5 connections (though that would be cool!) ;P.

saghm · on Aug 31, 2023

Yeah, I guess what I meant is that forcing one connection to swap back and forth at some interval would make it 1:1 on "average" over a longer period of time, but it seems presumptuous to think that would perform better, since the cost of having to re-establish a connection each time and then potentially communicate any "state" that would be needed to resume from whatever context was being executed before doesn't seem like a good tradeoff (not to mention the huge increase in complexity of the logic the client would need to have to handle this).

sokoloff · on Aug 31, 2023

> Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host.

This seems like a non-problem. What would you wish an idealized load balancer to do in this case and how would that improve any relevant metric?

sp332 · on Aug 31, 2023

Alternate the incoming connections from one server to the other. That gives the one with two connections some time to clear its queue while the other queue may be getting longer.

heavenlyblue · on Aug 31, 2023

Yeah I don't understand how they "dealt with it" in the first place.

phamilton · on Aug 31, 2023

The problem was that it was the same one machine that always had the extra 50% traffic. The eventual fix from Amazon was to randomly select in a tie breaker so that each machine had, on average, 2.1 concurrent requests (assuming 10 machines).

willvarfar · on Aug 31, 2023

I can't seem to find the article, but I remember some FAANG - FB I think - having an excellent blog post about a very hard to track down bug in their LB where performance fell off a cliff. They eventually tracked it down to an LRU and realised that the least loaded was actually the least hot and performance was substantially better if they tried to run small number of servers at capacity rather than a large number of servers all lower utilized.

Something like that.

Xcelerate · on Aug 31, 2023

Funny you mention that. When I worked there I identified a bug in an internal load balancer where new requests were being sent to the most overloaded hosts. It was far worse than if load balancing had been completely random and meant tens of thousands of servers were mostly sitting idle. The few lines of code to fix that bug likely amounted to the biggest financial impact I’ve had on a company in my whole career. Unfortunately, I didn’t have time to quantify the fix in monetary terms because I was in the midst of a reorg and the bug was totally unrelated to my main project. Kind of hilarious in retrospect to think about how such a giant problem went unnoticed for so long.

willvarfar · on Aug 31, 2023

any chance it was blogged about? It'd probably be the one I'm misremembering?!

Xcelerate · on Aug 31, 2023

Not this one haha

lazide · on Aug 31, 2023

By LRU you mean caching/preloading being terrible (and there being a lot of latency) because there wasn’t enough load to allow things to properly preload?

sethammons · on Aug 31, 2023

For sufficiently high volume services where loads can be uneven, we had a lot of success with thermostat adaptive balancing. The backend node reports how hot it is to the LB and gets a relative weight. If it can take more connections, it raises its relative weight, as it takes on more load, it lowers its relative weight.

This worked well on a fleet of a couple thousand servers handling thousands of requests per second each. I can't recall why, but lower volume and few servers caused uneven load, and we went back to least connections for those clusters.

ignoramous · on Aug 31, 2023

> The backend node reports how hot it is to the LB and gets a relative weight. If it can take more connections, it raises its relative weight, as it takes on more load, it lowers its relative weight.

Some load balancers support readiness checks. When a backend is over-worked it can fail readiness to excuse itself from more work. We implement one variation of this by looking at load metrics (cpu, disk, mem, net) [0]. readiness is binary (ok / !ok) and not weighted (ex: no. of conns). The backend, however, is free to load shed (reject already sent requests) as it sees fit.

[0] https://www.brendangregg.com/blog/2017-08-08/linux-load-aver...

xyzzy_plugh · on Aug 31, 2023

Load averages can be very useful, but it's problematic to combine them with something like readiness checks. I must point out the corresponding comment in loadavg.c which is required reading:

https://github.com/torvalds/linux/blob/b97d64c722598ffed42ec...

The parent's concept of reporting "hotness" is really powerful. It lets a central authority (the load balancer in this scenario) make an informed decision about where to send traffic: the coolest backend. In the case where all systems are actually quite hot, it probably makes sense in most scenarios to continue sending traffic as best-effort.

In the case of readiness checks, nacking a check removes the backend from the pool of eligible new-traffic recipients. At least this is true for systems I've encountered. The solution you propose is problematic as it's possible for all systems to decline traffic resulting in no available backends -- rarely a desirable state -- which the load balancer usually can't do much about.

A good system will allow you to decouple the signals (loadavg, connections, latency, CPU, etc.) and the health decision (readiness checks) and the interpretation of that decision (load balancer policy) to provide the best of both these worlds.

Now, there are many scenarios where simply excusing yourself works fine -- processing expensive batch jobs, for example -- so this can work nicely, but for typical production traffic scenarios I'd advise against this approach.

heavenlyblue · on Aug 31, 2023

> as it's possible for all systems to decline traffic resulting in no available backends

I don't see why this is a problem? This is when we should start rejecting requests on the frontend

sethammons · on Aug 31, 2023

presumably you want your service to be up. If you can smooth the load out to other nodes, your service stays up. If you have a bad strategy, you start losing capacity as nodes get removed.

Let's say each node can handle 5k rps and you have 10 nodes. You can handle 50k rps. If you are receiving 40k rps, a good strategy will put each node at 80% capacity. A bad strategy will knock a node out, reducing your total capacity, putting extra pressure on the rest of the system, causing more failures, and more pressure. This is called a thundering herd.

At some point, your only option is load shedding. But with a bad LB strategy, you start load shedding much much earlier than you should. This is a bad experience for customers that is avoidable with good LB strategies.

ignoramous · on Sept 2, 2023

Fair. Our solution is to make sure a biased load balancer (like in TFA) isn't sending the workload towards a select few machines while others may be not be as over worked, in as simple way as possible.

We run load balancers in a fail open mode. As in, if every backend is excusing itself, then none are excused.

But as you point out, load balancing is a hairy beast unto itself.

paulmd · on Aug 31, 2023

and also spinning up additional instances on the backend

ignoramous · on Sept 2, 2023

> Load averages can be very useful, but it's problematic to combine them with something like readiness checks.

We don't use loadavg. The linked article in my parent comment makes the same point as you are (:

sethammons · on Aug 31, 2023

in our case, the idea was not to shed all load (bool on/off), but to lower load slowly and to smooth traffic. Another confounding factor was that there were multiple services running on a single node and any of the services may be the one needing to push back a bit. We were able to have each service control the load onto the given node this way -- and if all nodes reported they were overloaded, they would all end up with the same weight meaning we kept processing as much as possible. If we just pulled them from the lb, you could crash the entire cluster during a period of overload.

jameshart · on Aug 31, 2023

> lower volume and few servers caused uneven load, and we went back to least connections for those clusters.

The more traffic you’re handling, the more the central limit theorem applies as you are summing the behavior of lots of random events drawn from various random distributions, and the more the system behavior regresses to the mean.

I think of it as analogous to how when an engine is idling it tends to pop and judder, but as you rev it up the smoother it sounds until at sustained high revs it can make almost a pure note. All the variations are smoothed out and the system runs much more evenly and predictably, but when it drops to idle speed the subtle variation in the fuel air mix and the pressure in the exhaust and inlet every stroke makes it chug and sputter chaotically.

twic · on Sept 1, 2023

To be clear: by "hot", you literally mean physical temperature? That's rather ingenious, i like it! I wonder i you could also measure power consumption.

CyberDildonics · on Aug 31, 2023

Why not just look at how much the CPU is being used?

upon_drumhead · on Aug 31, 2023

The "Power of Two Random Choices" was previously (and much better) discussed at https://news.ycombinator.com/item?id=37143376

magicalhippo · on Aug 31, 2023

As an aside, I worked on path tracing[1] long ago, and there was various schemes to avoid the "clumpyness" you'd get by using purely random numbers.

This included regular stratified sampling[2] as well as using quasi-random low-discrepancy sequences[3] like the Sobol sequence[4], which would "space out" the random samples.

Probably not terribly relevant for load balancing, but I did find the quasi-random stuff fascinating. How to make something "random enough" without being "too random".

edit: Though now I got a shower thought moment, could stratified sampling be used as a variant of the "two choices", even if each load-balancer worked independently? That is, each load-balancer stratifies the resources, and keeps a separate index to the current bin. Then for each request pick a random/suitable resource from the current bin and increment the bin index, wrapping when needed. Hmm...

[1]: https://en.wikipedia.org/wiki/Path_tracing

[2]: https://en.wikipedia.org/wiki/Stratified_sampling

[3]: https://en.wikipedia.org/wiki/Low-discrepancy_sequence

[4]: https://en.wikipedia.org/wiki/Sobol_sequence

vkoskiv · on Aug 31, 2023

The Halton sequence[1] is also a very nicely even distribution, and really quick to compute!

[1]: https://en.wikipedia.org/wiki/Halton_sequence

magicalhippo · on Aug 31, 2023

Thanks, I knew there was one important I was missing I just couldn't recall.

Karellen · on Aug 31, 2023

> there was various schemes to avoid the "clumpyness" you'd get by using purely random numbers.

^^ This is worth highlighting.

"Randomness is clumpy"

-- Attributed to various authors

https://en.wikipedia.org/wiki/Clustering_illusion

vinay_ys · on Aug 31, 2023

If every request took exactly the same amount of cpu time and wall time and same amount of ram and other resources, then random load-balancing is just fine for a sufficiently large number of machines. But there in lies the problem. Every request cost is different. And they have noisy neighbor consequences to other requests on the same machine etc. Hence, any naive load-balancing will lead to hotspots. If every request/response contained a X-Health: header that is understood by the load-balancer to decide how much headroom is available and if every request also had deadlines, then load-balancer can do interesting things with it to avoid hotspots, drive near-uniform utilization, avoid bad tail-latencies and provide near-ideal back pressure to upstream.

ignoramous · on Aug 31, 2023

> If every request took exactly the same amount of cpu time and wall time and same amount of ram and other resources, then random load-balancing is just fine for a sufficiently large number of machines.

The article points out random load balancing scales sub-linearly with increase in capacity?

  If we distribute a set of items randomly across a set of servers (e.g. by hashing, or by randomly selecting a server), the average number of items on each server is num_items / num_servers. It is easy to assume this means each server has close to the same number of items. However, since we are selecting servers at random, they will have different numbers of items, and the imbalance can be important.
  
  ...
  
  This is a classic balls in bins problem... the summary is that the imbalance depends on the expected number of items per server (that is, num_items / num_servers). This means workload is more balanced with fewer servers, or with more items. This means that dividing a set of items over more servers makes the distribution more unfair, which is a reason we can get worse than linear scaling of a distributed system.*

slashdev · on Aug 31, 2023

The client doesn’t know anything about your request implementation, so sending that via a header doesn’t make sense, unless it’s added by some middleware that does.

The load balancer could keep track of how long requests for a specific path tend to take, and what servers are currently handling and use that information to be smarter about routing requests. I wouldn’t be surprised if that already exists in things like haproxy or nginx.

There’s issues where people stuff data into the URL, like ids or keywords for SEO. But that’s possible to workaround.

mratsim · on Aug 31, 2023

Work-stealing has been there since 1996 and is optimal if greedy. Original Cilk paper: http://supertech.csail.mit.edu/papers/cilkjpdc96.pdf

There are other techniques that are optimal like Parallel Depth First scheduling from Blelloch.

Since then there has been a lot of work for distributed work-stealing.

Boggles my mind that at "cloud-scale" people still use round robin and assume uniform workloads and that at the server level there is uniform load as well.

cyrnel · on Aug 31, 2023

I wish load balancers would default to one of these optimal algorithms. When I call a "sort" function from the standard library, 99% of the time I don't have to go read papers to pick which sorting algorithm to use. The default algorithm is good for most cases. EDIT: probably because many of them are adaptive algorithms.

Seems many load balancer providers missed that memo.

lazide · on Aug 31, 2023

The issue is that the spread of actual cost of requests is usually extreme and opaque, and system load/capacity information is often very different between nodes.

And there is no standard way I’m aware of to usefully report back or communicate this data to the LB.

So what else are you going to do?

imtringued · on Aug 31, 2023

Probably because this would require both the load balancer and the webserver to support a custom protocol. My first guess is that some service meshes have an API that you could use for this.

ggm · on Aug 31, 2023

All things which have too few bins under random are at risk of asymmetry. Some LB situations take some approximation to load/responsiveness, and back off the counts as a function of observed pace of response. TCP RTT will do, size of the pending Queue for async responses too, time in queue, there's lots of measures. Your liveness check to the backends probably provides this simply by how responsive to heartbeat it is.

Given tasks do not run in identical time, there's going to be variance come what may.

I would suggest a pool of 3 should tend to be more stable. But, your platform people may be recommending that 3 and below need to be sized to work as 1, only 4 and above can start to assume a minimum cohort (2?) exist.

Sharding is not LB. It's just a hash approximation to grouping things. Maybe you're telling a story about using a random() to emulate how far out of 50/50 a 2 part sharding story can get?

Routers doing LB may well do simple prefix triage. You may be stuck on one side of the LB because of who you are, and nothings going to change it.

(I don't do this for a living, wiser people may tell you I'm wrong)

lazide · on Aug 31, 2023

It sounds like part of the issue is that people think random == evenly (but unpredictably) distributed on a specific (and relatively short) timescale.

For a truly random source, the odds of getting the same number in the next roll are identical to getting it on any prior roll. For example, flipping a coin twice is as likely to get you heads twice as it is to get you heads/tails.

eventually things even out. But if it’s truly random, when that is going to be the case should also be completely unpredictable. It might end up being after the service is deprecated, for instance. You might end up never getting a specific ‘desired’ answer ever. For instance, if the test is flipping a coin and we do it 10 times, there is a small chance we never get tails and get 10 heads in a row from a truly random source.

That is working as expected.

Usually (at least with load balancing) they instead want a smoother distribution, where it’s less likely to get the same answers they just got, when they just got them.

So for example, they’d prefer that if they just got a heads they’re more likely to get a tails next (but it won’t always happen).

Notably, this isn’t really random. It’s also a really hard thing to do in practice without also getting all nodes in lockstep, being predictable to an outside attacker, or having internode communication. Maybe impossible, I don’t remember.

beardedwizard · on Aug 31, 2023

Typically network LB will look at the entire layer 4 tuple to overcome things like NAT clustering traffic from many users into one place. Not to say IP lb doesn't happen, but with NAT being de-facto almost nobody does it IME.

RajT88 · on Aug 31, 2023

Another read I came across recently on how the Azure tuple-based load balancing isn't evenly distributed either:

https://techcommunity.microsoft.com/t5/core-infrastructure-a...

plasma · on Aug 31, 2023

Worth noting the difference between an AWS Application Load Balancer (ALB) that is HTTP request aware, and Network Load Balancer (NLB) which is not, when load balancing HTTP traffic.

AWS ALB (and others I'm sure) can balance by "Least outstanding requests" [1] which means a server with the least in-flight HTTP requests (not keep-alive network connections!) to an app server will be chosen.

If the balancer operates on the network level (eg NLB) and it maintains keep-alive connections to servers, the balancing won't be as even from a HTTP request perspective because a keep-alive connection may or may not be processing a request right now and so the request will be routed based on number of TCP connections to app servers, not current HTTP request activity to them.

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/appl...

joosters · on Aug 31, 2023

It's surprising that "least requests" is still so less commonly used than random or round-robin. It's a simple metric and yet it handles work-sharing remarkably well, without any need for servers to communicate their load back to the LB. (As you correctly say, the LB needs to be aware of idle/keep-alive connections in order to balance properly)

[Also, to be pedantic, it should really be called "fewest requests" - but I've never seen a product call it that!]

tedunangst · on Aug 31, 2023

The more unrealistic assumption is that every work item or request takes uniform processing. Things really blow up when you have a pile of p99 slow requests.

beardedwizard · on Aug 31, 2023

This. Load balancing based on count assumes every request is served in equal time and resources.

I think what the article misses highlighting is that this is about stateless load balancing which improves throughput at the LB, at the cost of potentially suboptimal distribution.

The other nit I have is calling hashing random, it's not, and a lot goes into selecting hash algorithms for LB that don't result in mass movement of requests every time a server is added or removed from the pool.

If the author wants stateless balanced load and assumes all requests are equal, why not use an approach like round robin?

Beyond that, you can get into more stateful methods that look at CPU use on servers, or any other metric you like to load balance on the resource you care about, ie memory or CPU vs number of open sockets.

hinkley · on Aug 31, 2023

Round Robin can get weird when you have common factors between the number of subrequests for a single request and the number of servers. I’ve seen everything from certain pages reliably blowing up in prod to weird caching bugs that don’t trigger in preproduction because the same box sees all of the requests.

Developers have a nasty habit of assuming a glitch or someone else changed something and if a reload fixes it they don’t note the pattern.

beardedwizard · on Sept 1, 2023

In load balancing they call this persistence. You are describing a case where you should persist to the server you land on.

In this scenario, unless each request is exactly the same you will end up with uneven load as some sessions persist longer than others.

This can still be stateless with use of a cookie for protocols that support it but is not the idealized "round robin and every request takes same time and resources".

The other alternative is a stateless app that is architected for round robin lb.

hinkley · on Aug 31, 2023

Chained microservices exacerbate this and people give you blank looks when you say it. Then you have to draw a picture because if that gets blanks then they have no idea what the Central Limit Theorem is.

chewbacha · on Aug 31, 2023

So, the real problem isn’t noted until towards the end which is that these methods require keeping some state, thus making the requests more expensive.

It’s a good trade off when throughput is low and the impact of lumpy traffic is high. At higher volumes, random allows stateless load balancing.

Use the tool that makes sense for your project. Even at the same system I’ll have some services that operate at low volumes with high latency and others with high volume and low latency.

hinkley · on Aug 31, 2023

Why does the first case have a chart and the second case just vomits out numbers, which incidentally don't fit into the page template?

bullen · on Aug 31, 2023

DNS should perform round-robin in the order of the IPs in the UDP packet. Otherwise you have no control.

Just like JSON (and js) hashmaps should really be treemaps and retain order.

Last but not least you should be able to make a TCP stream loose order. So you get UDP speed without changing pipes/ports and all that jazz.

It's all about doing that job when you design a protocol.

teddyh · on Aug 31, 2023

> DNS should perform round-robin in the order of the IPs in the UDP packet.

That only works if the clients choose the first IP address in the response to connect to. But IPv6, IIUC, specifies that clients instead always use the IPv6 address it has the longest stretch of bits in common with, compared to its own IPv6 address.

teddyh · on Sept 1, 2023

Fortunately, there will probably exist, in the somewhat near future, another solution providing simplistic load balancing: HTTPS and SVCB records in the DNS: <https://datatracker.ietf.org/doc/draft-ietf-dnsop-svcb-https...>

imtringued · on Aug 31, 2023

Nobody uses SRV records.

capableweb · on Aug 31, 2023

There is a interactive visualization that demonstrates different load balancing techniques, but I cannot for the life of me find it again. Maybe someone here knows which page i'm talking about and could provide a link?

zdc1 · on Aug 31, 2023

This one came up on HN previously. Playground is at the bottom: https://samwho.dev/load-balancing/

samwho · on Aug 31, 2023

Thanks for the shout out! Funnily, the article in the original post covers the most requested algorithm I never covered: power of two choices.