Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not surprised.

Let's say you've got a metric aggregation service, and that service crashes.

What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.

Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).

Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.

What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.

I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.





The real issue in your hypothetical scenario is a single bad metrics instance can bring the entire thing down. You could deploy multiple geographically distributed metrics aggregation services which establish the “canonical state” through a RAFT/PAXOS quorum. Then as long as a majority of metric aggregator instances are up the system will continue to work.

When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.

You need to design systems which do not rely on orchestration to remediate short transient errors.

Disclosure: I work on a core SRE team for a company with over 500 million users.


Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.

Please don’t. It doesn’t make sense, doesn’t help, doesn’t improve anything and is just waste of money, time, power and people.

Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.

Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.

And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"


24/7 on-call is basically mandatory at any major network, which cloudflare is. Your contractual relations with other networks will require it.

I'm not convinced that the SWE crowd of HN, particularly the crowd showing up to every thread about AI 'agents' really knows what it takes to run a global network or what a NOC is. I know saying this on here runs the risk of Vint Cerf or someone like that showing up in my replies, but this is seriously getting out of hand now. Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about. This has gotten way worse since the pre-ChatGPT days.

Lol preach

(Have worked as SRE at large global platform)

I just mostly over the last few years tune out such responses and try not to engage them. The whole uninformed "Well, if it were me, I would simply not do that" kind of comment style has been pervasive on this site for longer than AI though, IMO.


> Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about.

It took me a very long time to realize that^. I've worked with two NOC at two huge companies, and i know they still exist as teams at those companies. I'm not an SWE, though. And I'm not certain i'd qualify either company as truly "global" except in the loosest sense - as in, one has "American" in the name of the primary subsidiary.

^ i even regularly have used "the comments were people incorrecting each other about <x>", so i knew subconsciously that HN is just a different subset of general internet comments. The issue comes from this site appearing to be moderated, and the group of people that select for commenting here seem like they would be above average at understanding and backing up claims. The "incorrecting" label comes from n-gate, which hasn't been updated since the early '20s, last i checked.


The question is, which is better: 24/7 shift work (so that someone is always at work to respond, with disrupted sleep schedules at regular planned intervals) or 24/7 on-call (with monitoring and alerting that results in random intermittent disruptions to sleep, sometimes for false positives)?

Not even a night shift, just normal working hours in another part of the world.

There are kinds big step/jumps as the size of a company goes up.

Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.

Step 2: You steal all the underwear.

Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.


I confess myself bemused by your Step 2.

I'm like, come on! It's a South Park reference? Surely everybody here gets that???

<google google google>

"Original air date: December 16, 1998"

Oh, right. Half of you weren't even born... Now I feel ooooooold.


I think it is reasonable if the alarm trigger time is, say 5-10% of the time required to fix most problems.

Instead of downvoting me, I'd like to know why this is not reasonable?

It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.

Before you fire a quick alarm, check that the node is up, check that the service is up etc.


> How many types of false positives can there be?

Operating at the scale of cloudflare? A lot.

* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error

* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped

* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong

* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers

* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag

* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic

* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.

* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)

* traffic is actually down 90% at one datacenter because of a fiber cut somewhere

* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume

* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.

And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: