The most literal conclusion to draw from this story is that MRU connection pools shouldn’t be used for connections that traverse aggregated links.
Not just connections which traverse aggregated links. If you're load-balancing between multiple database replicas -- or between several S3 endpoints -- this sort of MRU connection pool will cause metastable load imbalances on the targets.
What I don't understand is why Facebook didn't simply fix their MRU pool: Switching from "most recent response received" to "most recent request sent" (out of the links which don't have a request already in progress, of course) would have flipped the effect from preferring overloaded links to avoiding them.
I coded the fix. We considered some options that had a bias away from the slow links, but we thought it was safer to avoid bias completely. It was fine in my tests, but I couldn't prove that biasing toward the faster link in a 2 link setup wouldn't set up an oscillation.
I wondered if that was it, but a 1/RTT loading behaviour seems safe enough to me -- after all, that's exactly what TCP does, and you're presumably not seeing any uncontrolled oscillations from that.
I guess it depends how much minimizing the size of your connection pool matters.
I am not sure that would work either. If I understand correctly, the root cause of delayed responses is because more requests were sent on that link due to hash collisions, which delays connections on that link. So, wouldn't "most recent request sent" correlate well with "most recent response received"?
EDIT: Also, this seems like a classic load balancing problem: Simply picking the least loaded connection would have been sufficient. The response time on each connection could be computed either explicitly (a running average/standard deviation of RPC finish times) or implicitly by checking the queue backlog at any time (the queue backlong being a first-order statistic).
But the cache is oblivious to which links the TCP connections are going over (well, except via latency & throughput as the post showed). If I understand what you propose, they need to add more bookkeeping to track what the connection did, instead of just doing pool.insert(conn_fd, time.now()). It sounds like the fix was just swapping the side you evict from, which could have been as little as a one line change.
Without a bit more information I think I am just speculating and don't really know. Their solution doesn't really taste bad to me though.
Out of nothing, I'd guess their MRU worked by just pushing an incoming free connection on top of a LIFO stack, while for the "most recent request sent" strategy you had to keep track of request timing and insert incoming free connections into the middle of the stack, complicating things.
Within their solution just had to convert the LIFO stack into an FIFO queue, keeping it simple.
well FIFO stack still exists as a synonym for queue, but you're right, stack and queue would have qualified my point good enough :)
anyway, looking at newer answers it seems that topic was not of concern for their solution...
That's exactly the thinking which led me to the first published cryptographic side channel attack against hyperthreading: Intel's optimization manual had a comment about being careful with stack alignment to avoid poor cache performance, and I thought "what if slow could instead be maliciously slow?"
This is definitely one of the most powerful debugging techniques for hard problems in my toolkit, though I phrase it as "If I wanted to produce this problem on purpose, how would I go about doing it?", since that is a general debugging technique and their phrasing is specific to communicating actors.
This results in me exclaiming every 4-6 months or so that I wouldn't know how to create this particular bug on purpose if I wanted to, which I suppose doesn't make much sense as a complaint about a bug until you start thinking this way. Anyhow, it isn't perfect and you will sometimes be defeated by the sheer perversity of bugs and their behavior. (I also find these are the tiny ones, like OR instead of XOR or something equally simple and at times even one-character, that produce mind-blowing behavior off that one error. The architectural bugs tend to give way to this analysis much more readily.) But it's a useful tool much of the time.
This was a great investigation of what must surely have been a complicated, frustrating, and expensive problem. Awesome writeup.
I love reading about post-mortems like this, even if they're unlikely to happen at my startup, because the problem-solving techniques that get displayed tend to generalize to things of almost any size.
Many years ago I did programmeing on a telephony application which communicated using the old SS7 network. In here you, as a layer 4 protocol entity, can send a "link selection" key in the protocol messages. It's just a 4 bit number (or 8 if it's ANSI SS7 instead of ITU , iirc) per message, and the switches maps this number to a physical downstream link, often in a configurable fashion.
This is pretty neat, as the application can default to a round robin distribution and dynamically weigth the link selection key based on detected congestion/overload to shift the load to other links - since then I've often wished the TCP/IP world offered something similar.
This keeps all the packets of a TCP stream on the same link, avoiding out-of-order delivery.
Not a network engineer, but I had thought that TCP handles ordering itself? Packets can travel completely different routes from origin to destination, so one can't expect them to arrive in order. Destination's TCP stack can deal with it.
Again, IANANE, but ISTM that those who first designed the network introduced an untested complication with their LIFO setup. Nearly anything would have worked, including not pooling at all. They just chose something weird for the hell of it. Later FB needed more performance out of the system, and this harmful complication was hidden from sight.
TCP reorders segments just fine but it treats disordering as a congestion signal and slows down. That's why link aggregation frequently uses the technique of pinning flows to a single link- to minimize disordering so as to avoid triggering TCP congestion control.
In Linux, you can change the sensitivity to disordering by writing to /proc/sys/net/ipv4/tcp_reordering.
When the pool was originally coded, MySQL had one thread per open connection. That made the auto-sizing MRU pool a pretty big win, because it kept the connection counts as low as possible while using the warmest database threads. MySQL has matured since then, so this optimization is no longer important.
Out-of-order delivery causes TCP to shrink the congestion windows, which cuts throughput. Connection pools help here (in addition to their reduction in setup and teardown work), because they let the windows widen and stay open. We disable tcp_slow_start_after_idle to take advantage of this.
Yes, TCP will reorder if necessary, but that will introduce some delay and overhead at the host. The missing sequence number (the out of order frame) will cause queuing as TCP waits for it to arrive or requests retransmit, so keeping streams in order in the network is more efficient.
" At a meta-level, the next time you are debugging emergent behavior, you might try thinking of the components as agents colluding via covert channels."
An insightful way of looking at a problem when the cause is not obvious.
Broken TCP window scaling? The connection should still work, with the sender backing off right? If you mean busted PMTUD yeah thats awesome when the handshake and GET works, but you cant get data back.
We saw busted PMTUD on a customer's network trying to get to us via a misconfigured network link. It's a bad problem to debug. None of us knew what PMTUD was, or that TCP can require ICMP to work, until that day.
It took two years to figure this out? When I read the first couple of paragraphs my first thought was "the low-latency links are getting dropped from the eligibility pool somehow" and it turns out that was the problem.
I think my intuition there stems from having a lot of full-stack responsibility (so the idea that it's someone else's problem was never a luxury I could afford, since it was always my problem) and having cleaned up a number of other people's very large spaghetti-code disasters.
The other possibility is that I have no special intuition into the problem (this is far more likely) but did have a fresh set of eyes, while all the people who were working on the problem were so intimately familiar with it that they couldn't see the forest for the trees.
One of the reasons this took a long time to figure out is that it was a failure amplifier, so there was always a more typical network problem preceding it. Network failures in a data center cause lots of changes to the packets, because of retries, failover, and automatic load balancing, so there were a lot of trees to look at.
That makes a lot more sense. It would have been nice to include some of the troubleshooting process so people can learn from that too. Thanks for sharing!
"It took two years to figure this out? When I read the first couple of paragraphs my first thought was "the low-latency links are getting dropped from the eligibility pool somehow" and it turns out that was the problem."
Well, sure, it's obvious when you're reading a prepared text that is carefully leading you up to that conclusion and has removed all extraneous information. Almost all bugs are shallow once you already know the answer.
The stability of it is the giveaway. It's not a meta-stable failure state, it's a fully stable failure state. There is positive feedback; increased latency on a link begets further link usage; which causes more latency, etc.
Had there been some kind of negative feedback (in the electrical engineering sense of the word) and this still happened somehow it might be a lot more difficult to track down. But introducing negative feedback is what they did to solve the problem, so perhaps it wouldn't have still happened.
Metastability comes in when you think about the problem that originally triggered congestion. A normal network overload caused by a bulk transfer, for example, can be fixed by canceling the transfer. Once you add the feedback loop, however, you need to both cancel the transfer and remove lots of other load. I use the term metastable because although it is fully stable (that's the problem!) it isn't the "ground state". Another way to say it is that the metastable states are only local attractors.
This is one of the best engineering write-ups I've ever read. It's also a principle I'll remember, since I work a lot with distributed systems and p2p networks: when debugging, imagine systems as adversaries and frame it as a security problem.
Does the server side flow hash manipulation dependent on ip tuple manipulation? Only enabled for some hosts/devices? A bit disappointed, was hoping for clever dscp or mpls tag manipulation. If they rely on new streams with specific src ports its a lot less interesting, and less useful for long lived connections.
Not just connections which traverse aggregated links. If you're load-balancing between multiple database replicas -- or between several S3 endpoints -- this sort of MRU connection pool will cause metastable load imbalances on the targets.
What I don't understand is why Facebook didn't simply fix their MRU pool: Switching from "most recent response received" to "most recent request sent" (out of the links which don't have a request already in progress, of course) would have flipped the effect from preferring overloaded links to avoiding them.