Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The most literal conclusion to draw from this story is that MRU connection pools shouldn’t be used for connections that traverse aggregated links.

Not just connections which traverse aggregated links. If you're load-balancing between multiple database replicas -- or between several S3 endpoints -- this sort of MRU connection pool will cause metastable load imbalances on the targets.

What I don't understand is why Facebook didn't simply fix their MRU pool: Switching from "most recent response received" to "most recent request sent" (out of the links which don't have a request already in progress, of course) would have flipped the effect from preferring overloaded links to avoiding them.



I coded the fix. We considered some options that had a bias away from the slow links, but we thought it was safer to avoid bias completely. It was fine in my tests, but I couldn't prove that biasing toward the faster link in a 2 link setup wouldn't set up an oscillation.


I wondered if that was it, but a 1/RTT loading behaviour seems safe enough to me -- after all, that's exactly what TCP does, and you're presumably not seeing any uncontrolled oscillations from that.

I guess it depends how much minimizing the size of your connection pool matters.


Great reasoning. Problem: bias isn't balanced, causing complex failure. Solution A: change the bias. Solution B: remove all bias.

One of these options makes things simpler. I'm always in favor of that.


I am not sure that would work either. If I understand correctly, the root cause of delayed responses is because more requests were sent on that link due to hash collisions, which delays connections on that link. So, wouldn't "most recent request sent" correlate well with "most recent response received"?

EDIT: Also, this seems like a classic load balancing problem: Simply picking the least loaded connection would have been sufficient. The response time on each connection could be computed either explicitly (a running average/standard deviation of RPC finish times) or implicitly by checking the queue backlog at any time (the queue backlong being a first-order statistic).


But the cache is oblivious to which links the TCP connections are going over (well, except via latency & throughput as the post showed). If I understand what you propose, they need to add more bookkeeping to track what the connection did, instead of just doing pool.insert(conn_fd, time.now()). It sounds like the fix was just swapping the side you evict from, which could have been as little as a one line change.

Without a bit more information I think I am just speculating and don't really know. Their solution doesn't really taste bad to me though.


Out of nothing, I'd guess their MRU worked by just pushing an incoming free connection on top of a LIFO stack, while for the "most recent request sent" strategy you had to keep track of request timing and insert incoming free connections into the middle of the stack, complicating things.

Within their solution just had to convert the LIFO stack into an FIFO queue, keeping it simple.


stack implies LIFO and queue implies FIFO. Otherwise they wouldn't have different names. :)


well FIFO stack still exists as a synonym for queue, but you're right, stack and queue would have qualified my point good enough :) anyway, looking at newer answers it seems that topic was not of concern for their solution...


That was also my first idea about solving it. No idea why they didn't took this route.


I want to know why they're violating the end-to-end principle in their switches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: