Solving the Mystery of Link Imbalance: A Metastable Failure State at Scale

cperciva · on Nov 26, 2014

The most literal conclusion to draw from this story is that MRU connection pools shouldn’t be used for connections that traverse aggregated links.

Not just connections which traverse aggregated links. If you're load-balancing between multiple database replicas -- or between several S3 endpoints -- this sort of MRU connection pool will cause metastable load imbalances on the targets.

What I don't understand is why Facebook didn't simply fix their MRU pool: Switching from "most recent response received" to "most recent request sent" (out of the links which don't have a request already in progress, of course) would have flipped the effect from preferring overloaded links to avoiding them.

ngbronson · on Nov 26, 2014

I coded the fix. We considered some options that had a bias away from the slow links, but we thought it was safer to avoid bias completely. It was fine in my tests, but I couldn't prove that biasing toward the faster link in a 2 link setup wouldn't set up an oscillation.

cperciva · on Nov 26, 2014

I wondered if that was it, but a 1/RTT loading behaviour seems safe enough to me -- after all, that's exactly what TCP does, and you're presumably not seeing any uncontrolled oscillations from that.

I guess it depends how much minimizing the size of your connection pool matters.

mabbo · on Nov 26, 2014

Great reasoning. Problem: bias isn't balanced, causing complex failure. Solution A: change the bias. Solution B: remove all bias.

One of these options makes things simpler. I'm always in favor of that.

xtacy · on Nov 26, 2014

I am not sure that would work either. If I understand correctly, the root cause of delayed responses is because more requests were sent on that link due to hash collisions, which delays connections on that link. So, wouldn't "most recent request sent" correlate well with "most recent response received"?

EDIT: Also, this seems like a classic load balancing problem: Simply picking the least loaded connection would have been sufficient. The response time on each connection could be computed either explicitly (a running average/standard deviation of RPC finish times) or implicitly by checking the queue backlog at any time (the queue backlong being a first-order statistic).

mcpherrinm · on Nov 26, 2014

But the cache is oblivious to which links the TCP connections are going over (well, except via latency & throughput as the post showed). If I understand what you propose, they need to add more bookkeeping to track what the connection did, instead of just doing pool.insert(conn_fd, time.now()). It sounds like the fix was just swapping the side you evict from, which could have been as little as a one line change.

Without a bit more information I think I am just speculating and don't really know. Their solution doesn't really taste bad to me though.

ifdefdebug · on Nov 26, 2014

Out of nothing, I'd guess their MRU worked by just pushing an incoming free connection on top of a LIFO stack, while for the "most recent request sent" strategy you had to keep track of request timing and insert incoming free connections into the middle of the stack, complicating things.

Within their solution just had to convert the LIFO stack into an FIFO queue, keeping it simple.

hueving · on Nov 26, 2014

stack implies LIFO and queue implies FIFO. Otherwise they wouldn't have different names. :)

ifdefdebug · on Nov 26, 2014

well FIFO stack still exists as a synonym for queue, but you're right, stack and queue would have qualified my point good enough :) anyway, looking at newer answers it seems that topic was not of concern for their solution...

karavelov · on Nov 26, 2014

That was also my first idea about solving it. No idea why they didn't took this route.

stevejones · on Nov 26, 2014

I want to know why they're violating the end-to-end principle in their switches.

mabbo · on Nov 26, 2014

"The breakthrough came when we started thinking about the components in the system as malicious actors colluding via covert channels."

That's probably the piece of that article that I'm going to remember for a long time.

cperciva · on Nov 26, 2014

That's exactly the thinking which led me to the first published cryptographic side channel attack against hyperthreading: Intel's optimization manual had a comment about being careful with stack alignment to avoid poor cache performance, and I thought "what if slow could instead be maliciously slow?"

jerf · on Nov 26, 2014

This is definitely one of the most powerful debugging techniques for hard problems in my toolkit, though I phrase it as "If I wanted to produce this problem on purpose, how would I go about doing it?", since that is a general debugging technique and their phrasing is specific to communicating actors.

This results in me exclaiming every 4-6 months or so that I wouldn't know how to create this particular bug on purpose if I wanted to, which I suppose doesn't make much sense as a complaint about a bug until you start thinking this way. Anyhow, it isn't perfect and you will sometimes be defeated by the sheer perversity of bugs and their behavior. (I also find these are the tiny ones, like OR instead of XOR or something equally simple and at times even one-character, that produce mind-blowing behavior off that one error. The architectural bugs tend to give way to this analysis much more readily.) But it's a useful tool much of the time.

jxf · on Nov 26, 2014

This was a great investigation of what must surely have been a complicated, frustrating, and expensive problem. Awesome writeup.

I love reading about post-mortems like this, even if they're unlikely to happen at my startup, because the problem-solving techniques that get displayed tend to generalize to things of almost any size.

noselasd · on Nov 26, 2014

Many years ago I did programmeing on a telephony application which communicated using the old SS7 network. In here you, as a layer 4 protocol entity, can send a "link selection" key in the protocol messages. It's just a 4 bit number (or 8 if it's ANSI SS7 instead of ITU , iirc) per message, and the switches maps this number to a physical downstream link, often in a configurable fashion.

This is pretty neat, as the application can default to a round robin distribution and dynamically weigth the link selection key based on detected congestion/overload to shift the load to other links - since then I've often wished the TCP/IP world offered something similar.

jessaustin · on Nov 26, 2014

This keeps all the packets of a TCP stream on the same link, avoiding out-of-order delivery.

Not a network engineer, but I had thought that TCP handles ordering itself? Packets can travel completely different routes from origin to destination, so one can't expect them to arrive in order. Destination's TCP stack can deal with it.

Again, IANANE, but ISTM that those who first designed the network introduced an untested complication with their LIFO setup. Nearly anything would have worked, including not pooling at all. They just chose something weird for the hell of it. Later FB needed more performance out of the system, and this harmful complication was hidden from sight.

The_Fox · on Nov 26, 2014

TCP reorders segments just fine but it treats disordering as a congestion signal and slows down. That's why link aggregation frequently uses the technique of pinning flows to a single link- to minimize disordering so as to avoid triggering TCP congestion control.

In Linux, you can change the sensitivity to disordering by writing to /proc/sys/net/ipv4/tcp_reordering.

ngbronson · on Nov 26, 2014

When the pool was originally coded, MySQL had one thread per open connection. That made the auto-sizing MRU pool a pretty big win, because it kept the connection counts as low as possible while using the warmest database threads. MySQL has matured since then, so this optimization is no longer important.

Out-of-order delivery causes TCP to shrink the congestion windows, which cuts throughput. Connection pools help here (in addition to their reduction in setup and teardown work), because they let the windows widen and stay open. We disable tcp_slow_start_after_idle to take advantage of this.

darrin · on Nov 26, 2014

Yes, TCP will reorder if necessary, but that will introduce some delay and overhead at the host. The missing sequence number (the out of order frame) will cause queuing as TCP waits for it to arrive or requests retransmit, so keeping streams in order in the network is more efficient.

wazokazi · on Nov 26, 2014

" At a meta-level, the next time you are debugging emergent behavior, you might try thinking of the components as agents colluding via covert channels."

An insightful way of looking at a problem when the cause is not obvious.

_3u10 · on Nov 26, 2014

Oh man, this reminds me of troubleshooting a host behind a router that didn't support window scaling.

It took a week or two to figure out why we couldn't transfer data but could connect, ssh, ping, etc.

donavanm · on Nov 26, 2014

Broken TCP window scaling? The connection should still work, with the sender backing off right? If you mean busted PMTUD yeah thats awesome when the handshake and GET works, but you cant get data back.

ak217 · on Nov 26, 2014

We saw busted PMTUD on a customer's network trying to get to us via a misconfigured network link. It's a bad problem to debug. None of us knew what PMTUD was, or that TCP can require ICMP to work, until that day.

msandford · on Nov 26, 2014

It took two years to figure this out? When I read the first couple of paragraphs my first thought was "the low-latency links are getting dropped from the eligibility pool somehow" and it turns out that was the problem.

I think my intuition there stems from having a lot of full-stack responsibility (so the idea that it's someone else's problem was never a luxury I could afford, since it was always my problem) and having cleaned up a number of other people's very large spaghetti-code disasters.

The other possibility is that I have no special intuition into the problem (this is far more likely) but did have a fresh set of eyes, while all the people who were working on the problem were so intimately familiar with it that they couldn't see the forest for the trees.

ngbronson · on Nov 26, 2014

One of the reasons this took a long time to figure out is that it was a failure amplifier, so there was always a more typical network problem preceding it. Network failures in a data center cause lots of changes to the packets, because of retries, failover, and automatic load balancing, so there were a lot of trees to look at.

msandford · on Nov 27, 2014

That makes a lot more sense. It would have been nice to include some of the troubleshooting process so people can learn from that too. Thanks for sharing!

jerf · on Nov 26, 2014

"It took two years to figure this out? When I read the first couple of paragraphs my first thought was "the low-latency links are getting dropped from the eligibility pool somehow" and it turns out that was the problem."

Well, sure, it's obvious when you're reading a prepared text that is carefully leading you up to that conclusion and has removed all extraneous information. Almost all bugs are shallow once you already know the answer.

msandford · on Nov 26, 2014

The stability of it is the giveaway. It's not a meta-stable failure state, it's a fully stable failure state. There is positive feedback; increased latency on a link begets further link usage; which causes more latency, etc.

Had there been some kind of negative feedback (in the electrical engineering sense of the word) and this still happened somehow it might be a lot more difficult to track down. But introducing negative feedback is what they did to solve the problem, so perhaps it wouldn't have still happened.

http://en.wikipedia.org/wiki/Feedback

ngbronson · on Nov 26, 2014

Metastability comes in when you think about the problem that originally triggered congestion. A normal network overload caused by a bulk transfer, for example, can be fixed by canceling the transfer. Once you add the feedback loop, however, you need to both cancel the transfer and remove lots of other load. I use the term metastable because although it is fully stable (that's the problem!) it isn't the "ground state". Another way to say it is that the metastable states are only local attractors.

api · on Nov 26, 2014

This is one of the best engineering write-ups I've ever read. It's also a principle I'll remember, since I work a lot with distributed systems and p2p networks: when debugging, imagine systems as adversaries and frame it as a security problem.

Thanks!

donavanm · on Nov 26, 2014

Color me confused. Just a week ago they posted "We also have server-side means to “hash away” and route around trouble spots, if they occur."

https://code.facebook.com/posts/360346274145943/introducing-...

Does the server side flow hash manipulation dependent on ip tuple manipulation? Only enabled for some hosts/devices? A bit disappointed, was hoping for clever dscp or mpls tag manipulation. If they rely on new streams with specific src ports its a lot less interesting, and less useful for long lived connections.

mcpherrinm · on Nov 26, 2014

This problem has been ongoing for two years: the post you linked described their new data center design, which may not even be online yet.

hueving · on Nov 26, 2014

They explicitly mention in the article that this may not apply to their new data center.

spullara · on Nov 26, 2014

This is one of the reasons why I like to include debugging a problem with a couple of levels of indirection for in-depth technical interviews.

jimmyhmiller · on Nov 26, 2014

FYI: The page layout is broken in the latest firefox 34 beta on mac. It appears to be the image that is breaking it.