There's a constant tension between speed of detection and false positive rates.
Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.
At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
I work on something at a similar scale to 1.1.1.1, if we had this kind of setup our oncall would never be asleep (well, that is almost already the case, but alas). It's easy to say "just implement X monitor and you'd have caught this" but there's a real human cost and you have to work extremely vigilently at deleting monitors or you'll be absolutely swamped with endless false positive pages. I don't think a 5 minute delay is unreasonable for a service this scale.
This just seems kinda fundamental: the entire service was basically down, and it took 6+ minutes to notice? I’m just increasingly perplexed at how that could be. This isn’t an advanced monitor, this is perhaps the first and most important monitor I’d expect to implement (based on no closely relevant experience).
I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.
I'm sure some engineer at cloudflare is evaluating something like this right now, and try it on historical data how many false positives that would've generated in the past, if any.
Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
This is hardly the first 1.1.1.1 outage. It’s also probably about the first external monitoring behaviour I imagine you’d come up with. That’s why I’m surprised—more surprised the longer I think about it, actually; more than five minutes is a really long delay to notice such a fundamental breakage.
Is your external monitor working? How many checks failed, in what order? Across how many different regions or systems? Was it a transient failure? How many times do you retry, and at what cadence? Do you push your success or failure metrics? Do you pull? What if your metrics don’t make it back? How long do you wait before considering it a problem? What other checks do you run, and how long do those take? What kind of latency is acceptable for checks like that? How many false alarms are you willing to accept, and at what cadence?
Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.