At Scale, Rare Events Aren’t Rare

CJefferson · on April 5, 2017

One quote I've heard, that i find helps non-experts:

You don't have to personally prepare for winning the lottery, but the lottery has to prepare for somebody winning.

kyleschiller · on April 5, 2017

As a pretty good rule of thumb, a system that fails 1/nth of the time and has n opportunities to fail has ~.63 probability of failure, where n is more than ~10.

Graph: http://www.meta-calculator.com/online/?panel-102-graph&data-...

szemet · on April 5, 2017

And this is the generalization: https://en.m.wikipedia.org/wiki/Poisson_limit_theorem

obstinate · on April 5, 2017

Or as my first boss and mentor had a habit of saying, when you run a billion trials, one in a million events will happen about a thousand times.

Rainymood · on April 5, 2017

Very nice rule of thumb, honestly I did not expect it to (sort of) converge to ~63%. Does anyone have some intuition for this?

pedrosorio · on April 5, 2017

If a system has probability 1/n to fail, then it has probability 1 - 1/n to not fail. The probability it will not fail after n trials is (1 - 1/n) ^ n. The limit of this quantity when n->+inf is 1/e.

If you want to know the probability it will fail, just take 1 - probability_success = 1 - 1/e.

SamReidHughes · on April 5, 2017

I'd say, hey, how do you calculate (1-h)^k? First take the natural log: ln((1-h)^k) = k ln(1-h) = -kh. And then exponentiate back up: e^(-kh). (For small values of h, ln(1-h) = -h by linear approximation.) (Edit: Wiped out looong comment.)

stavros · on April 5, 2017

I think by "intuition" the GP meant "for the non-mathematicians" :P

pc86 · on April 5, 2017

It's always amusing when someone asks for a layman/non-math/intuitive reason why something works out and HN responds with a 3-paragraph long proof that seems to always require university-level math. And it seems those comments almost invariably start with "Oh, you just..."

TeMPOraL · on April 5, 2017

'pedrosorio gave a nice one upthread[0].

Ultimately, it's hard to give a math-free explanation for something that comes out straight from math. If you break down an explanation into small enough steps, they should be comprehensible for anyone even if they have to take some steps on faith.

--

[0] - https://news.ycombinator.com/item?id=14040434

stavros · on April 5, 2017

He did, yes, I was just amused by the GP's answer!

sokoloff · on April 5, 2017

Reminds me of my favorite "HN isn't the normal world" exchange:

https://news.ycombinator.com/item?id=35079

pc86 · on April 5, 2017

In a sibling thread on that page:

    cperciva 3548 days ago [-]

    That is my startup idea. I don't want to take this 
    thread even more off-topic (if that's even possible), 
    but please feel free to contact me at the address in 
    that first post to explain why you think it is a bad 
    idea.
 	
      dhouston 3548 days ago [-]

      we're in a similar space -- http://www.getdropbox.com 
      (and part of the yc summer 07 program) basically, 
      sync and backup done right (but for windows and os 
      x). i had the same frustrations as you with existing
      solutions. let me know if it's something you're 
      interested in, or if you want to chat about it 
      sometime.

      drew (at getdropbox.com)

lsd5you · on April 5, 2017

10 machines with a 10% chance of failure roughly equal to 100 machines with a 1% of failure.

I think confusingly worded, as n increases the reliability of each node has to increase correspondingly to get the convergence. I'm not sure what real system this reflects, but I suppose it's an indication of at what point the problems of scale will bite (if you know your rough failure rate).

kyleschiller · on April 5, 2017

Honestly I don't remember, it's 1-(1/e) if that helps.

EDIT: http://math.stackexchange.com/questions/82034/prove-that-1-f...

stcredzero · on April 5, 2017

Is this related to the fuel fraction of a rocket that can accelerate to its own exhaust velocity?

erwan · on April 5, 2017

Coincidentally, a couple months ago one of my professors told me about James Hamilton, apparently, they met when he was studying at Waterloo. I started reading his blog.

This guy is brilliant. And his comments are often gems. The articles are good too but think of them as conversation openers. The real deal is in the comments imho. I recommend it. A somewhat funny article is about Zynga, the "prodigal kid" who left AWS for on-premises, only to come back later: http://perspectives.mvdirona.com/2015/05/the-return-to-the-c...

In the comments, he debunks (or take a shot at it) the idea that cloud providers like AWS aren't a good fit for organizations which have massive, but stable workloads.

This guy is so cool. I only wish he had more time to write.

jasonallen · on April 5, 2017

It reminds me of an old story about Microsoft Windows. Back in the early 2000's, compiling and building Windows from source code took many hours on very specialized build hardware. Meanwhile there were thousands of developers who contributed to the full Windows stack. If any developer checked in a build failure, it would cause the build to be delayed. Well, at that scale (of thousands of developers), you can't compile Windows even if devs only commit 1 build break per year. Bad times...

InclinedPlane · on April 5, 2017

There was essentially instability and chaos in the big dev heavy divisions at MS when they all worked in one branch, but that led very rapidly to more sophisticated models with more points of validation between the average dev and the common code/builds that qa and everyone else shared and used.

lolive · on April 5, 2017

The common pattern nowadays is to code review each commit, build each commit on Jenkins & pass tests, git bisect, etc... What was the procedure back then ?

tscs37 · on April 5, 2017

"don't break the build" was the procedure.

lolive · on April 5, 2017

amen!

InclinedPlane · on April 6, 2017

Do you mean before or after? Before, it was "do your best", which was never enough, not with everyone partying in the same branch. After, it was a matter of breaking things up into different feeder branches down to individual teams. Code that goes "up" into more mainline branches is required to have a higher QC bar in order to get in, such as a full clean build of the entire set of sources (which for something like windows, office, or visual studio / .net could take a very long time) and running the entire set of build verification tests with zero failures (also a long process). Code would go up during an integration window, there would inevitably be some instability that would need cleaned up (test failures, maybe code breaks sometimes) which would get stabilized, then code would flow back "down" to individual teams from those good builds. And code would flow up further into other branches that would be shared and used by more teams around the division/company.

Most dev groups these days use different systems because they tend not to have such monolithic projects, they can leverage automation better, and it's rare to have literally thousands of devs all working on the same software. When you can only get a few builds out per day, or maybe only one, then you have to become a lot more careful at keeping a separation between devs coding away at their desktops and the builds that everyone else depends on.

Minikloon · on April 5, 2017

If the bar is set at breaking the build once a year, sounds like we're all average devs.

stavros · on April 5, 2017

Speak for yourself, I'm an abysmal dev by that metric.

optimuspaul · on April 5, 2017

If I haven't broken a build in any given day you know I haven't written any code that day.

cookiecaper · on April 5, 2017

"One in a million is next Tuesday" from Larry Osterman is another great reflection on this from the software perspective. [1]

[1] https://blogs.msdn.microsoft.com/larryosterman/2004/03/30/on...

ath0 · on April 5, 2017

Love this article but think the headline actually makes the wrong point - this is a product management issue, not a "it may never happen" issue. That it takes someone like James to know two wildly different domains - both the business-level details at risk (a $1M generator is worth potentially damaging if the alternative is a guaranteed $100M revenue loss) and the details of power engineering (overriding the switch only risks the generator, not a datacenter fire or loss of life) - is a shame.

Could the power engineering team have made this tradeoff more clear to the project managers doing the initial install? And yet, exposing a million little configuration options to the end-user isn't the right approach either.

iNerdier · on April 5, 2017

It may be silly but if something is unlikely I stop and remind myself: if something is a 'one in a million' chance, then it happened to 7000 people today.

cafebabbe · on April 5, 2017

"one in a million" happens several times per second, in todays CPUs :)

Fun small experiment. Pick a random number between 1 and 15000000 (odds for winning the lottery in my country). Loop until you pick it again (i know it's pseudorandom and cylic; but the period seems big enough for the experiment).

Watch how freaking fast it happens (sub-second), and how many iterations it took (dozens of millions).

AstralStorm · on April 5, 2017

Also why ECC memory is used... and chances to corrupt a bit of memory are much lower than that.

kev009 · on April 5, 2017

Storage devices regularly go berserk in really novel and interesting ways when you have a large enough pool. Most projects I've worked on, I've known enough to fix the bugs, had a working theory of what the issues were and could fix if I really needed to after higher priorities, or could somehow work around. With storage devices, I'm frequently bewildered and stuck maybe to the last category at best. There are times when I sit back and just think, wow, how amazing is it that computers work at all knowing the things that do go wrong.

jacquesm · on April 5, 2017

> There are times when I sit back and just think, wow, how amazing is it that computers work at all knowing the things that do go wrong.

There is a special category of bugs named for that kind of feeling, they're called schrödinbug. The idea is that once you've noticed that something couldn't work it promptly stops working.

Avernar · on April 5, 2017

The part about the switchgear vendor deciding to do something a certain way that the customer didn't want because it can cause a rare failure reminded me of something that happened to me. Way back I bought a 1500VA UPS that was not an APC but still a known brand to protect my home server. The decision was based on cost as it was significantly less money.

One night I was near the server when the power went out. So I sat there waiting to see the auto shutdown. Soon enough the UPS told the server to shutdown and it was well on it's way to power off. Just before it shut off the power came back. And the UPS stopped beeping and went back to normal operation... while the server completed shutdown.

And know I have a server that's off and if I wasn't around I would have not known what happened. When I got the UPS I just did one test to make sure the server shut off and the UPS shut itself off without draining its batteries. This meant that I plugged it back in AFTER the UPS powered off. I never considered that the manufacturer of the UPS would botch the power restored after telling the server to shut down sequence.

I contacted the manufacturer about this. I told them that after telling the server to shut down there was only a brief window where a power restored signal would maybe abort shutdown. Once the UPS monitoring program is terminated during shutdown there's no turning back. Nothing came of that.

So now I buy only APC gear. They do the proper thing that if the AC power comes back after a shutdown command is issued the UPS will continue the shutdown sequence. And when the UPS shuts off it sees the power back on and restarts itself and the server comes back online.

Other manufacturers may do it correctly and the one I dealt with might have clued in and fixed it but I'm not willing to gamble anymore.

dredmorbius · on April 5, 2017

Though it's not all that rare, there's the question of dealing with death in large online social or identity networks.

With Google having some 3 billion Android / Chrome / Gmail profiles, and Facebook roughly as many users, standard actuarial statistics suggest that, even if allowing for multiple profiles per human, the number of newly dead accounts per day probably runs to the tens of thousands.

(Globally, deaths run about 120,000/day, so the figure's within reason.)

Which means you probably want to consider your processes for such matters, as well as various related issues, such as mistakenly presuming a user has died, or being falsely informed, how to handle data assets after death, etc., etc.

Scale matters.

snowwrestler · on April 5, 2017

Related: Wired profile of the author

https://www.wired.com/2013/02/james-hamilton-amazon/

graphememes · on April 5, 2017

Unfortunately most people don't realize this until they are at scale.

sametmax · on April 5, 2017

It's ok, since most people never get "at scale". When you start getting these problems, it means you reached the stars you were shooting for. Having those issues are a good thing.

nadermx · on April 5, 2017

Assuming you have the resources to cover these events, sometimes you hit scale, but not profitability..

sametmax · on April 5, 2017

Yes but in that case the scaling technical issues are not your main problem. Your main problem is your business model. Priorities, priorities...

InclinedPlane · on April 5, 2017

Not always. It's possible to bury yourself under so much technical debt that you can't get out from under it. To extend the metaphor, you can end up spending all your revenue merely servicing the debt and you cannot invest in paying it down. Your system can be so flawed that it hampers innovation, and requires ongoing costly maintenance and support which eats up all of your dev resources. The end result is that innovation is blocked off for you and if you run into a bump that reduces your review significantly for a while you can easily end up bankrupt.

sametmax · on April 5, 2017

Same answer as to the other comment.

eeZah7Ux · on April 5, 2017

Amazing how most developers don't know about probability and think software is not expected to survive hardware failures.

mirimir · on April 5, 2017

Reminds me of this: http://spectrum.ieee.org/tech-talk/computing/it/nsa-data-cen...

Turing_Machine · on April 5, 2017

There was an article a while back that used the term "Walmart scale". If some weird customer interaction happens one time in a million transactions, it happens ten times a day at Walmart.

jzl · on April 5, 2017

"You know, the most amazing thing happened to me tonight. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!" -- Richard Feynman

oakridge · on April 5, 2017

Well, Murphy's law. Any non-zero probability event will occur at least once given infinite time.

throwaway2048 · on April 5, 2017

More like infinity times

x * ∞ = ∞ where X > 0

jmcdiesel · on April 5, 2017

In an infinite universe, nothing is rare, by that logic?

Rare is relative, so just because something happens an trillion times it can still be nearly nonexistently rare, given the data sample is trillions of trillions?

Seems like a silly thought..

doug1001 · on April 6, 2017

i would have chosen a slightly different title, perhaps "at scale, improbable events are frequent" or something like that

"at scale" of course just means more much frequent sampling; it's not some sort of alternate reality where good become evil, rare becomes not rare, etc

juskrey · on April 5, 2017

Rare events are self-similar.

ianai · on April 5, 2017

And yet we still haven't observed proton decay.

AstralStorm · on April 5, 2017

Partly because observing it requires very specific conditions. That makes the scale really tiny actually.

kriro · on April 5, 2017

It's a one in a million shot. But it might just work.

stcredzero · on April 5, 2017

There is a market opportunity here for serving customers who would rather risk broken generators but ensure constant power.