We've used both round robin and least outstanding connections in AWS and found m...

saghm · on Aug 31, 2023

> Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host. I believe AWS has fixed this and now tie-breakers are resolved randomly, but it was kinda painful to deal with. Our solution was to scale vertically and run fewer hosts. 19 on one host vs 20 on another was less jarring.

I assume you're saying that this happened over a period much longer than the lifetime of a connection? If all of the connections are super long lived, I'm not really sure what else the load balancer could do, other than forcibly close one and make the client swap to the other server, but naively that doesn't really seem like something I'd want a load balancer to do.

saurik · on Aug 31, 2023

Even more basic, wouldn't that just make the split 3:2 instead of 2:3 (aka, the same thing)? I don't think we can give each server 2.5 connections (though that would be cool!) ;P.

saghm · on Aug 31, 2023

Yeah, I guess what I meant is that forcing one connection to swap back and forth at some interval would make it 1:1 on "average" over a longer period of time, but it seems presumptuous to think that would perform better, since the cost of having to re-establish a connection each time and then potentially communicate any "state" that would be needed to resume from whatever context was being executed before doesn't seem like a good tradeoff (not to mention the huge increase in complexity of the logic the client would need to have to handle this).

sokoloff · on Aug 31, 2023

> Low concurrency meant that one host consistently had 2 connections and another consistently had 3. That's 50% more load on one host.

This seems like a non-problem. What would you wish an idealized load balancer to do in this case and how would that improve any relevant metric?

sp332 · on Aug 31, 2023

Alternate the incoming connections from one server to the other. That gives the one with two connections some time to clear its queue while the other queue may be getting longer.

heavenlyblue · on Aug 31, 2023

Yeah I don't understand how they "dealt with it" in the first place.

phamilton · on Aug 31, 2023

The problem was that it was the same one machine that always had the extra 50% traffic. The eventual fix from Amazon was to randomly select in a tie breaker so that each machine had, on average, 2.1 concurrent requests (assuming 10 machines).

willvarfar · on Aug 31, 2023

I can't seem to find the article, but I remember some FAANG - FB I think - having an excellent blog post about a very hard to track down bug in their LB where performance fell off a cliff. They eventually tracked it down to an LRU and realised that the least loaded was actually the least hot and performance was substantially better if they tried to run small number of servers at capacity rather than a large number of servers all lower utilized.

Something like that.

Xcelerate · on Aug 31, 2023

Funny you mention that. When I worked there I identified a bug in an internal load balancer where new requests were being sent to the most overloaded hosts. It was far worse than if load balancing had been completely random and meant tens of thousands of servers were mostly sitting idle. The few lines of code to fix that bug likely amounted to the biggest financial impact I’ve had on a company in my whole career. Unfortunately, I didn’t have time to quantify the fix in monetary terms because I was in the midst of a reorg and the bug was totally unrelated to my main project. Kind of hilarious in retrospect to think about how such a giant problem went unnoticed for so long.

willvarfar · on Aug 31, 2023

any chance it was blogged about? It'd probably be the one I'm misremembering?!

Xcelerate · on Aug 31, 2023

Not this one haha

lazide · on Aug 31, 2023

By LRU you mean caching/preloading being terrible (and there being a lot of latency) because there wasn’t enough load to allow things to properly preload?