My three-word summary would be "Polling is good", though I don't see the word "p...

ignoramous · on Dec 24, 2022

Not really. It is about eliminating modes and variance from hyperscaled systems.

The aritcle itself summarises itself:

> Both our Network Load Balancer configuration system, and our Route 53 health check system are actually doing many thousands of operations for every "tick" or "cycle" that they iterate. But those operations don't change because the health check statuses did, or because of customer configurations. That's the point. They're like coffee urns, which hold hundreds of cups of coffee at a time no matter how many customers are looking for a cup.

On health checks:

> Even when there are only a few health checks active, the health checkers send a set of results to the aggregators that is sized to the maximum. For example, if only 10 health checks are configured on a particular health checker, it's still constantly sending out a set of (for example) 10,000 results, if that's how many health checks it could ultimately support. The other 9,990 entries are dummies. However, this ensures that the network load, as well as the work the aggregators are doing, won't increase as customers configure more health checks. That's a significant source of variance ... gone.

On updates:

> Every few seconds, the health check aggregators send a fixed-size table of health check statuses to the Route 53 DNS servers. When the DNS servers receive it, they store the table in memory, pretty much as-is. That’s a constant work pattern. Every few seconds, receive a table, store it in memory. Why does Route 53 push the data to the DNS servers... because there are more DNS severs than there are health check aggregators... check out Joe Magerramov’s article on putting the smaller service in control.

> Then, at query time... even if the first answer it tried is healthy and eligible, the server checks the other potential answers anyway. This approach ensures that even if a status changes, the DNS server is still performing the same work that it was before. There's no increase in scan or retrieval time.

On configuration:

> Rather than generate events, AWS Hyperplane integrates customer changes into a configuration file that's stored in Amazon S3. This happens right when the customer makes the change. Then, rather than respond to a workflow, AWS Hyperplane nodes fetch this configuration from Amazon S3 every few seconds. The AWS Hyperplane nodes then process and load this configuration file... Even if the configuration is completely identical to what it was the last time, the nodes process and load the latest copy anyway. Effectively, the system is always processing and loading the maximum number of configuration changes. Whether one load balancer changed or hundreds, it behaves the same.

donavanm · on Dec 24, 2022

Exactly this. Eliminating variable modes and especially additional work or slower paths during a state change like a failure.

Another classic is debug logs. Turning on additional logging when your system is under stress is a great way to increase the impact. Instead always record/emit the telemetry youll likely need and discard later if its unnecessary to retain.

HyperSane · on Dec 24, 2022

The idea is to make the worse case scenario the only scenario. Very clever.

vlovich123 · on Dec 24, 2022

It’s an idea but I’m not sure I’m convinced. There’s no evidence provided that doing this actually improves availability / reduces cascading failures / improves time to recover. And AWS doesn’t really have the best track record in terms of uptime…

jaggederest · on Dec 24, 2022

What do you mean, obviously if the status dashboard is green, nothing is wrong. /s

Who are you going to believe, me, or your own lying eyes?

barbazoo · on Dec 24, 2022

> Even when there are only a few health checks active, the health checkers send a set of results to the aggregators that is sized to the maximum. For example, if only 10 health checks are configured on a particular health checker, it's still constantly sending out a set of (for example) 10,000 results, if that's how many health checks it could ultimately support. The other 9,990 entries are dummies. However, this ensures that the network load, as well as the work the aggregators are doing, won't increase as customers configure more health checks

Won't the 10 health checks be less work, i.e. CPU cycles, than 10,000 checks? Sure you can iterate through 10k instead of 10 but performing a health check, parsing the result, etc, won't that still scale with the numbers of health checks configured? How could that be a constant amount of work in this example?

hansvm · on Dec 24, 2022

If you assume some upper bound of checks per tick/checker (10k in the example), then they're performing that same, large amount of work at a regular interval. The author lightly addressed that when talking about O(1) vs O(C) work.

The problem being addressed is that if work increases over time, when shit hits the fan, ..., then you're prone to running into cliffs or modes of operation, where the system transitions from behaving perfectly to failing completely. If you don't push the limits regularly then you're unlikely to know what those limits might be. The solution here is to do extra work (you're definitely right, 10k checks is more than 10), but in such a way that you're constantly exercising that limit and have no variance in the work the system does. Running a short period of time in that configuration gives you confidence that your bounds are reasonable and protects the rest of the system from an explosion of extra work caused by retries and whatnot.

vlovich123 · on Dec 25, 2022

It's honestly a bit contradictory. It's budgeting the system to always run at the worst possible case and then it says:

> At the same time, there are cases where the constant work pattern doesn’t fit quite as well. If you’re running a large website that requires 100 web servers at peak, you could choose to always run 100 web servers. This certainly reduces a source of variance in the system, and is in the spirit of the constant work design pattern, but it’s also wasteful. For web servers, scaling elastically can be a better fit because the savings are large. It’s not unusual to require half as many web servers off peak time as during the peak. Because that scaling happens day in and day out, the overall system can still experience the dynamism regularly enough to shake out problems. The savings can be enjoyed by the customer and the planet.

Well, then why not just do a stress test on these things regularly rather than constantly running it at worst case? For example, once every few hours generate the 10k items but otherwise leave the system running normally. Same effect without running it always in "worst case" mode.

hansvm · on Dec 26, 2022

It's not really contradictory; it just says that this pattern has benefits and sometimes they're not worth the cost.

To your other point, that was my first thought as well. The main counterargument that comes to mind is that if all your systems (and those they're interfacing with which you don't control) aren't subject to the whole load at once (assuming you have more than one service running and thus care about the effects of interactions) or if there's some nondeterminism or emergent properties at play (so that you're not just looking at monotonic components combining together) then the bimodality of that system is still undesirable; perhaps it hits a nice point in the cost/benefit curve though.