Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Reliability, constant work, and a good cup of coffee (amazon.com)
122 points by luu on Dec 24, 2022 | hide | past | favorite | 32 comments


I more or less buy into the author’s analogy. I have two work modes: building things that have to work so I avoid unnecessary complexity and services I don’t have experience with, and, the second mode is where I am experimenting and learning.

BTW, in the opening “”” One of my favorite paintings is “Nighthawks” by Edward Hopper. A few years ago, I was lucky enough to see it in person at the Art Institute of Chicago.””” My wife and I saw the same art piece - that is what hooked me into reading the entire article. I wish I was better at writing introductory ’hooks.’


FWIW I thought the “lede” was far too buried and stopped reading before I got to it… so this style doesn’t work for everyone.


This reminds me a lot of the controller/control loop pattern used in Kubernetes with the additional constraint that the amount of work done in a loop is fixed.

https://kubernetes.io/docs/concepts/architecture/controller/


My three-word summary would be "Polling is good", though I don't see the word "polling" mentioned once.


Not really. It is about eliminating modes and variance from hyperscaled systems.

The aritcle itself summarises itself:

> Both our Network Load Balancer configuration system, and our Route 53 health check system are actually doing many thousands of operations for every "tick" or "cycle" that they iterate. But those operations don't change because the health check statuses did, or because of customer configurations. That's the point. They're like coffee urns, which hold hundreds of cups of coffee at a time no matter how many customers are looking for a cup.

On health checks:

> Even when there are only a few health checks active, the health checkers send a set of results to the aggregators that is sized to the maximum. For example, if only 10 health checks are configured on a particular health checker, it's still constantly sending out a set of (for example) 10,000 results, if that's how many health checks it could ultimately support. The other 9,990 entries are dummies. However, this ensures that the network load, as well as the work the aggregators are doing, won't increase as customers configure more health checks. That's a significant source of variance ... gone.

On updates:

> Every few seconds, the health check aggregators send a fixed-size table of health check statuses to the Route 53 DNS servers. When the DNS servers receive it, they store the table in memory, pretty much as-is. That’s a constant work pattern. Every few seconds, receive a table, store it in memory. Why does Route 53 push the data to the DNS servers... because there are more DNS severs than there are health check aggregators... check out Joe Magerramov’s article on putting the smaller service in control.

> Then, at query time... even if the first answer it tried is healthy and eligible, the server checks the other potential answers anyway. This approach ensures that even if a status changes, the DNS server is still performing the same work that it was before. There's no increase in scan or retrieval time.

On configuration:

> Rather than generate events, AWS Hyperplane integrates customer changes into a configuration file that's stored in Amazon S3. This happens right when the customer makes the change. Then, rather than respond to a workflow, AWS Hyperplane nodes fetch this configuration from Amazon S3 every few seconds. The AWS Hyperplane nodes then process and load this configuration file... Even if the configuration is completely identical to what it was the last time, the nodes process and load the latest copy anyway. Effectively, the system is always processing and loading the maximum number of configuration changes. Whether one load balancer changed or hundreds, it behaves the same.


Exactly this. Eliminating variable modes and especially additional work or slower paths during a state change like a failure.

Another classic is debug logs. Turning on additional logging when your system is under stress is a great way to increase the impact. Instead always record/emit the telemetry youll likely need and discard later if its unnecessary to retain.


The idea is to make the worse case scenario the only scenario. Very clever.


It’s an idea but I’m not sure I’m convinced. There’s no evidence provided that doing this actually improves availability / reduces cascading failures / improves time to recover. And AWS doesn’t really have the best track record in terms of uptime…


What do you mean, obviously if the status dashboard is green, nothing is wrong. /s

Who are you going to believe, me, or your own lying eyes?


> Even when there are only a few health checks active, the health checkers send a set of results to the aggregators that is sized to the maximum. For example, if only 10 health checks are configured on a particular health checker, it's still constantly sending out a set of (for example) 10,000 results, if that's how many health checks it could ultimately support. The other 9,990 entries are dummies. However, this ensures that the network load, as well as the work the aggregators are doing, won't increase as customers configure more health checks

Won't the 10 health checks be less work, i.e. CPU cycles, than 10,000 checks? Sure you can iterate through 10k instead of 10 but performing a health check, parsing the result, etc, won't that still scale with the numbers of health checks configured? How could that be a constant amount of work in this example?


If you assume some upper bound of checks per tick/checker (10k in the example), then they're performing that same, large amount of work at a regular interval. The author lightly addressed that when talking about O(1) vs O(C) work.

The problem being addressed is that if work increases over time, when shit hits the fan, ..., then you're prone to running into cliffs or modes of operation, where the system transitions from behaving perfectly to failing completely. If you don't push the limits regularly then you're unlikely to know what those limits might be. The solution here is to do extra work (you're definitely right, 10k checks is more than 10), but in such a way that you're constantly exercising that limit and have no variance in the work the system does. Running a short period of time in that configuration gives you confidence that your bounds are reasonable and protects the rest of the system from an explosion of extra work caused by retries and whatnot.


It's honestly a bit contradictory. It's budgeting the system to always run at the worst possible case and then it says:

> At the same time, there are cases where the constant work pattern doesn’t fit quite as well. If you’re running a large website that requires 100 web servers at peak, you could choose to always run 100 web servers. This certainly reduces a source of variance in the system, and is in the spirit of the constant work design pattern, but it’s also wasteful. For web servers, scaling elastically can be a better fit because the savings are large. It’s not unusual to require half as many web servers off peak time as during the peak. Because that scaling happens day in and day out, the overall system can still experience the dynamism regularly enough to shake out problems. The savings can be enjoyed by the customer and the planet.

Well, then why not just do a stress test on these things regularly rather than constantly running it at worst case? For example, once every few hours generate the 10k items but otherwise leave the system running normally. Same effect without running it always in "worst case" mode.


It's not really contradictory; it just says that this pattern has benefits and sometimes they're not worth the cost.

To your other point, that was my first thought as well. The main counterargument that comes to mind is that if all your systems (and those they're interfacing with which you don't control) aren't subject to the whole load at once (assuming you have more than one service running and thus care about the effects of interactions) or if there's some nondeterminism or emergent properties at play (so that you're not just looking at monotonic components combining together) then the bimodality of that system is still undesirable; perhaps it hits a nice point in the cost/benefit curve though.


Can someone explain why anti-fragility is the correct term to use here? It feels like this is a highly resilient and fault-correcting system, but the definition of anti-fragility is the ability to increase in capacity to thrive in the face of stressors -- how exactly is that the case here?


What a great read. The author truly understands what it takes to design reliable high-performance systems. It reminds me of how well designed hard real-time systems and high-performance games are implemented.


I confess I assumed those were percolators. Used to be a rather common way to make coffee, as I understand it.

Ironically to the article's point, my understanding is those fell out of favor for being bad coffee.


> Second, many coffee urns contain heating elements and thermostats, so as you take more coffee out of them, they actually perform a bit less work. There’s just less coffee left to keep warm.

I know this is just an analogy... but I don't think that's how heat transfer works. Heat loss here is primarily a function of temperature difference across the urn walls, and and thermal conductivity of the urn walls. I don't see how replacing coffee with air would slow down the heat loss.


I think he means it takes less energy to heat a smaller volume of coffee.


From cold, sure. But he's talking about an urn that started full of hot coffee, and then saying energy input required to maintain that temperature will decrease as the coffee volume decreases. What's the mechanism for that supposed to be?

The more I think about it, the more I think he's just wrong about the coffee thing. Contents of the urn have no effect on energy required to maintain its temperature.


I think OP is correct. Think about the extremes of the urn, full vs 1mL in it. It will require much less energy to heat the 1mL state than the full state.

Regarding the contents, the thermal conductivity of liquid (Water mostly for coffee) is higher than that of air, so it transfers heat out of the urn faster than an urn full of air. It also has higher heat capacity which requires more heat energy per degree change of temperature.

Heat capacity is given by (C) = heat absorbed (Q) / temperature change (ΔT); heat capacity increases with volume


Here I was expecting a pragmatic nerd approach to producing reliably good cups of coffee :(.


Check out James Hoffmann on youtube [1]!

[1] https://www.youtube.com/@jameshoffmann


Or you can check out Hames Joffman for an unhelpful summary:

[2] https://youtube.com/@hamesjoffmann


I only knew the painting from the meme with the badger. Thought it to be a random scene.


    s/a good cup/many cups/
As the article itself points out, unlike computing at scale, nothing about enormous warming urns of coffee ensures a good cup. Reliable and constant.


The best cup of coffee is the one I got right now.


Definitely not true. Is a mix of, "hunger is the best spice," and, "a bird in hand..." But those don't compose, as it were. I've had many cups of coffee that I just poured out, as they were gross.


"Quantity has a quality of its own"


Stalin (who knew a thing or two about quantities taking on their own quality), in case anyone wonders. (though some have blamed Napoleon)


As long as you like your coffee brewed and served in the way _we_ decide, then we’ve got gallons of coffee for you.


Is the title a reference to the movie 'Stutz'?


The most offensive element in the article was general dismissal of espresso as actual coffee rather than the brown sludge that fills most large vats at conference centers and apparently Amazon offices.

Sure, it makes for a good analogy, but by the same token if you ever find yourself in Italy the post-lunch rush is never delayed with long waits because pulling a shot is near-instantaneous.

In fact espresso was the method derived earlier in the century to refuel factory workers quickly, hence had whilst standing, ready to get back to labor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: