Cloudflare was down

rococode · on July 17, 2020

Discord is entirely down right now, both the website and the app itself. Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first. Downdetector, monitortheinternet, etc.

Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy

Edit: 15 min later, looks like things are starting to come back up

saagarjha · on July 17, 2020

Hacker News is an excellent status page for those cases.

blisseyGo · on July 17, 2020

Out of curiosity, done HN use any CDN or other way of DDOS protection? dang?

ivalm · on July 17, 2020

Their dns record points to only one IP (209.216.230.240) that goes to M5 Computer Security

compumike · on July 17, 2020

It's a hosting company in San Diego: https://www.m5hosting.com/

I host a dedicated server there (running https://www.circuitlab.com/) and when I traceroute/ping news.ycombinator.com, it's two hops (and 0.175 ms) away :)

kohtatsu · on July 17, 2020

ówò

zackees · on July 17, 2020

[flagged]

brian-armstrong · on July 17, 2020

This was a BGP/routing issue and has already been documented. Please don't spread misinformation and hysteria, especially on technical issues like this

https://twitter.com/eastdakota/status/1284253034596331520

zackees · on July 17, 2020

[flagged]

eastdakota · on July 18, 2020

It was a misconfiguration we applied to a router in Atlanta during routine maintenance. That caused bad routes on our private backbone. As a result, traffic from any locations connected to the backbone got routed to Atlanta. It resulted in about 50% of traffic to our network to not resolve for about 20 minutes. Locations not connected to our backbone were not impacted. It was a human error. It was not an attack. It was not a failure or bug in the router. We're adding mitigations to our backbone network as we speak to ensure that a mistake like this can't have broad impacts in the future. We'll have a blog post with a full explanation up in the next hour or so — being written now by our CTO.

fsociety · on July 18, 2020

You should really read up on BGP. It really is that flimsy.

Godel_unicode · on July 18, 2020

It's not failure, it's misconfiguration. Totally different issue to be resilient to fat fingers than fires.

kingbirdy · on July 17, 2020

I wouldn't count on Keemstar as a reliable source of cyber-attack coverage

dtertman · on July 17, 2020

It was no such thing. A Cloudflare router advertised some bad routes.

v0tary · on July 17, 2020

Keemstar? Not reliable source. The dude has no idea what he is talking about.

snazz · on July 17, 2020

Don't think so. They used to use Cloudflare but stopped. To my knowledge, it's a single server without a database (using the filesystem as a database).

ksec · on July 17, 2020

So HN is serving 5.5M page view daily (excluding API access ) on a single server without CDN and without a database?

Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Edit: The number is from Dang [1]

>These days around 5.5M page views daily and something like 5M unique readers a month, depending on how you try to count them.

[1] https://news.ycombinator.com/item?id=23808787

blhack · on July 17, 2020

>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Spin up an apache installation and see how many requests you can serve per second if you're just serving static files off of an SSD. It's a lot.

edit: I see that there are already a bunch of other comments to this effect. I think you're comment is really going to bring out the old timers, haha. From my perspective, the "modern web" is absolutely insane.

cmroanirgo · on July 17, 2020

>* From my perspective, the "modern web" is absolutely insane. *

Agreed.

I was brought up as a computer systems engineer... So, not a scientist, but I always worked with the basic premise of keep it simple. I've worked on projects where we built all the fangled clustering and master/slave (sorry to the PC crowd, but that's what it was called) stuff but never once needed it in practice. Our stuff could easily handle saturated gigabit networks as the 2 core cpu only running at 40%. We had cpu spare and could always add more network cards before we needed to split the server. It was less maintenance, for sure. It also had self healing so that some packets could be dropped if the client config allowed it, if the server decided it wanted to (but only ever did on the odd dodgey client connection)

That said, I was always impressed by the map-reduce of for search results (yes, I know they've moved on) which showed how massive systems can be fast too. It seemed that the rest of the world wanted to become like Google, and the complexity grew for the std software shop, when it didn't need to imho.

I jumped ship at that point and went embedded, which was a whole lot more fun for me.

Sincerely, old timer

katbyte · on July 18, 2020

[flagged]

tripzilch · on July 18, 2020

How about we spend our energy fixing systemic/institutional racism first, because language will follow quite naturally.

The other way around surely doesn't work, and is just symbolic gestures without actual change.

katbyte · on July 26, 2020

There is only so much i can do - i'm not american and thats a change i can't make on my own aside from doing my best to be an ally when possible.

However i can open a few PRs and use some of my time to make that change. It's a minor inconvenience to me and if it makes even one black person feel heard and supported then yea, i'm gonna do it.

ksec · on July 17, 2020

>I think you're comment is really going to bring out the old timers, haha.

That is great ! :D

>It's a lot.

Well yes, but HN isn't really static though. Fairly Dynamics with Huge number of users and comments. But still, I think I need to rethink lots of assumption in terms of speed, scale and complexity.

arghwhat · on July 17, 2020

Huge numbers of users don't really mean that much. Bandwidth is the main cost, but that's kept low by having a simple design.

Serving the same content several times in a row requires very few resources - remember, reads far outnumber writes, so even dynamic comment pages will be served many times in between changes. 5.5 million page views a day is only 64 views a second, which isn't that hard to serve.

As for the writes, as long as significant serialization is avoided, it is a non-issue.

(The vast majority of websites could easily be designed to be as efficient.)

Izkata · on July 17, 2020

There is some caching somewhere as well, probably provides a bit more boost.

I've been at my work laptop (not logged in) and found something I wanted to reply to, so I pulled out my phone and did so. For a good 10 seconds afterwards, I could refresh my phone and see my comment, but refresh the laptop and not see it.

outworlder · on July 17, 2020

> From my perspective, the "modern web" is absolutely insane.

You know, it should be even better than it was in the past, because a lot of heavy lifting is now done on the client. If we properly optimized our stuff, we could potentially request tiny pieces of information from servers, as opposed to rendering the whole thing.

Kinda like native apps can do(if the backend protocols are not too bloated)

nickjj · on July 17, 2020

> Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

It doesn't need to be crazy.

A static site on DigitalOcean's $5 / month plan using nginx will happily serve that type of traffic.

The author of https://gorails.com hosts his entire Rails video platform app on a $20 / month DO server. The average CPU load is 2% and half the memory on the server is free.

The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases. This outage is also an unfortunate reminder that you can have the fanciest infrastructure set up ever and you're still going down due to DNS.

outworlder · on July 17, 2020

> The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases

True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).

afuchs · on July 18, 2020

> True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).

To support this, look at how many people criticize Kubernetes as being too focused on what huge companies need instead of what their small company needs. Kubernetes still has its place, but some peoples expectations may be misplaced.

For a side project, or anything low traffic with low reliability requirements, a simple VPS or share hosting suffices. Wordpress and PHP are still massively popular despite React and Node.js existing. Someone who runs a site off of shared hosting with Wordpress can have a very different vision about what their business/sideproject/etc will accomplish compared to someone who writes a custom application with a "modern" stack.

m0xte · on July 17, 2020

Modern web is a completely broken mess.

We were serving around that traffic off a single dual pentium 3 in 2002 quite happily off IIS/SQL Server/ASP. The amount of information presented has not grown either.

That little box had some top tier brand main corporate web sites on it too and was pumping out 30-40 requests a second peak. There was no CDN.

thinkloop · on July 18, 2020

You were not serving that traffic, you were just serving your core functionality - no tracking, no analytics, no ads, no a/b, no dark mode, no social login, no responsiveness. Are most of those shitty? Sure, just let me know when you figure out how to pry a penny from your users for all your hard work.

leijurv · on July 18, 2020

Oh no, not the dark mode! The sacrifices we have to make for performance I guess...

m0xte · on July 18, 2020

Easy. We built something that was worth money without all that.

Not a one trick marketoid pony.

Can_Not · on July 18, 2020

Dark mode and responsive webdesign are both good for the user and efficient for the server and user's device.

simias · on July 17, 2020

That means an average of about 63 pages per second. Let's say that the total number of queries is tenfold and take a worst case scenario and round up to 1000 queries per second and then multiply by ten to get 10k queries per second, because why not.

I don't know what the server's specs are but I'm sure it must be quite beefy and have quite a few cores, so let's say that it runs about 10 billions instructions per second. That means a budget of about one million instructions per page load in this pessimistic estimate.

The original PlayStation's CPU ran at 33MHz and most games ran at 30fps, so about 1million cycles per fully rendered frame. The CPU was also MIPS and had 4KiB of cache, so it did a lot less with a single cycle than a modern server would. Meanwhile the HN servers has the same instruction budget to generate some HTML (most of which can be cached) and send it to the client.

A middle of the line modern desktop CPU can nowadays emulate the entire PlayStation console on a single core in real time, CPU, GPU and everything else, without even breaking a sweat.

>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Magic, clearly.

jeffbee · on July 17, 2020

That's only ~60 QPS, assume it is peaky and hits something more like 1000 QPS in peak minutes, but also assume most of the hits are the front page which contains so little information it would fit literally in the registers of a modern x86-64 CPU.

Even a heavyweight and badly written web server can hit 100 QPS per core, and cores are a dime a dozen these days, and storage devices that can hit a million ops per second don't cost anything anymore, either.

saagarjha · on July 17, 2020

In-memory databases? That's amateurish. Time to make a service that runs out of ymm registers.

compumike · on July 17, 2020

Not sure where your 5.5M number came from, but that's only 64 requests per second.

90 to 99% of those are logged-out users, so fully cacheable.

Only a handful of dynamic requests each second remain.

ci5er · on July 17, 2020

Unlike Reddit, logged in and logged out users largely see the same thing. I wouldn't imagine there is much logic involved in serving personalized pages, when they don't care who you are.

viraptor · on July 17, 2020

The username is embedded in the page, so you can't do full caching unfortunately. But the whole content could be easily cached and concatenated to header/footer.

tylerhou · on July 18, 2020

You also can't do that every time because of hidden, flagged, showdead, etc.

viraptor · on July 18, 2020

That could be split as a setting in the user-specific header with the visibility part handled client-side. A bit more work, but it's not impossible if it's worth it.

ci5er · on July 17, 2020

Oh! I have never looked an HN frames, etc., but I assumed the header was separate. Thank you!

Sohcahtoa82 · on July 17, 2020

It's amazing how little CPU power it takes to run a website when you're not trying to run every analytics engine in the world and a page load only asks for 5 files (3 of which will likely already be cached by the client) that total less than 1 MB.

manquer · on July 17, 2020

It is certainly doable , PoF was running lot of page views famously of a single IIS server for a long time .

HN is written in a lisp variant and most the stack is built in-house , it is not difficult to imagine efficiency improvements when many abstraction layers have been removed from your stack .

ci5er · on July 17, 2020

I don't remember PoF being famous for that, but they got a lot of bang for the buck on their serving costs.

What I do remember, is that it was a social data collection experiment for a couple of social scientists, that never originally expected that many people would actually find ways to find each other and hook up using it.

I miss their old findings reports about how weird humans are and what they lie about. Now, it's just plain boring with no insights released to the public.

blisseyGo · on July 18, 2020

POF was also run by a single dude until he sold it for some 600 million dollars!

manquer · on July 18, 2020

There are couple of old posts about it in hpc blog .

Nick carver from SO also one mentioned that they could run SO if a single server , while it wasn’t fun it was doable and had happened some time .

dmarlow · on July 17, 2020

For all my sibling comments, there is also context to be aware of. 5.5m page views daily can come is many shapes and sizes. Yes, modern web dev is a mess, but situation is very different from site to site. This should be taken as a nice anecdote, not as a benchmark.

idlewords · on July 17, 2020

You can serve a lot of flat files from a properly configured server in 2020. It's just that most people don't bother trying.

jyap · on July 17, 2020

You don’t need a CDN if what you’re serving up in this case is mostly all text.

Just need good stable code and server side caching.

rodgerd · on July 17, 2020

Back in 2000 a joke project of mine got slashdotted. Ran outta bandwidth before anything else.

ci5er · on July 17, 2020

With DO, these days, they don't run me out of bandwidth, but my instance falls over (depending on what I am doing - which ain't much), but with AWS, they auto-scale and I get a $5000 bill at the end of the month. I prefer the former.

rodgerd · on July 18, 2020

Yeah, the bandwidth overage was a grand, give or take. It was a valuable lesson in a number of ways, and why for any personal things I wouldn't touch AWS with a shitty stick.

deadbunny · on July 17, 2020

AWS only auto scales if you configure it to...

ci5er · on July 18, 2020

Yeah, but I'm stupid and didn't understand all the switches.

NikolaeVarius · on July 18, 2020

S3 will scale as much as is needed by itself

chc · on July 17, 2020

A lot of modern web technology is inefficient for the sake of being ergonomic. Here's what Hacker News looks like: https://github.com/arclanguage/anarki/blob/master/apps/news/...

mst · on July 17, 2020

From an old-school lisper perspective, the code seems perfectly ergonomic to me.

It's ergonomic in a very lispy way but perfectly reasonably so from the POV of that aesthetic.

loeg · on July 17, 2020

Well, it's not magic. So, the other one.

zaksoup · on July 17, 2020

I had this same reaction. Definitely feels like most of what we’re doing with “the modern” web is probably wrong.

zetalemur · on July 17, 2020

That's why we have this bloated over-engineered multi-node applications: People just underestimate how fast modern computing is. To serve ~2^6 requests/sec is trivial.

It's easily served by a simple server.

q3k · on July 17, 2020

1M queries per day is ~10 queries per second. It's a useful conversion rate to keep in mind when you see anyone brag about millions of requests per day.

tuananh · on July 18, 2020

that number is not very big

i used to host a wordpress site that has 5M pageviews a month on a $10 (and later $20) digitalocean instance.

that's wordpress and a shared vps. I imagine it could be a lot higher if I have dedicated server and use self-written software.

foota · on July 17, 2020

You have to remember there's a lot of seconds in a day, that's only 60 qps.

hdjrkrmfkt · on July 17, 2020

HN could probably be served by Python running on a fancy laptop.

dboreham · on July 18, 2020

Everything we're doing is wrong.

techntoke · on July 17, 2020

Flat-file DBs and mountable DB file systems are the future.

bitdeep · on July 17, 2020

Man, if this is true, this guys have steel balls.

layoutIfNeeded · on July 17, 2020

We've been telling this for a while now...

root_axis · on July 17, 2020

I don't get this comment, what does page serving performance have to do with "the modern web"? It's not as if serving up a js payload would make it more difficult to host millions of requests on a single machine, html and js are both just text.

winrid · on July 17, 2020

Makes sense based on what I've read about Arc, which HN is written in.

I've been working on something where the DB is also part of the application layer. The performance you can get on one machine is insane, since you spend minimal time on marshalling structures and moving things around.

pwdisswordfish2 · on July 18, 2020

"They used to use Cloudflare but stopped."

They are still using Cloudflare. Unlike CF, M5 does not require SNI.

   curl --resolve news.ycombinator.com:443:104.20.43.44 https://news.ycombinator.com

AdamGibbins · on July 17, 2020

It's hosted on AWS, you don't need DDoS protection, just a big wallet.

dijit · on July 17, 2020

Thats not even slightly true.

Their IP belongs to AS21581; which is registered with a company called 'M5 Computer hosting' out of west-coast USA.

m5hosting.com

The last hop is Santa Barbera.

Definitely does not fall in the AWS ranges.

saagarjha · on July 17, 2020

Hacker News is self-hosted: https://news.ycombinator.com/item?id=22767439. Let me see if I can find a better link where the specs are discussed.

gruez · on July 17, 2020

Only if you use their "infinitely scaling" services. eg. s3. If the attacker is hammering you with expensive queries and your database is on 1 ec2 server, you're still going to go down.

ivalm · on July 17, 2020

The nameserver is aws but IP4 points to “m5 computer security”

macNchz · on July 17, 2020

My iPhone actually popped up a message saying that my wifi didn't appear to have internet, which was strange and obviously false as I was actively using the internet on it and the laptop next to it, but now it makes sense that it must have been pinging something backed by cloudflare!

maxk42 · on July 17, 2020

Discord attempted to route me to: everydayconsumers.com/displaydirect2?tp1=b49ed5eb-cc44-427d-8d30-b279c92b00bb&kw=attorney&tg1=12570&tg2=216899.marlborotech.com_47.36.66.228&tg3=fbK3As-awso

(Visit at your own risk.)

Hack?

jeffus · on July 17, 2020

I'd be looking at your browser extensions or malware (if you use the Discord app).

iamtheyammer · on July 17, 2020

Sure you didn't misspell discord?

maxk42 · on July 17, 2020

I've never even heard of the site before. Nor have I searched for "attorney" any time recently.

biermic · on July 17, 2020

We operate that site and are using Cloudflare to prevent DDOS attacks. Probably some sort of hash collision...

rocho · on July 17, 2020

Crazy stuff!

Kye · on July 17, 2020

It looks like you mistyped it and landed on a domain with spammy redirects. They have all kinds of weird URLs and there's not always any connection to anything you did other than go to the wrong domain.

maxk42 · on July 17, 2020

I didn't mis-type. I press 'd' and Firefox fills-in the site for me.

If I type 'e' I get 'en.wikipedia.org'.

I was redirected.

jpxw · on July 17, 2020

How can this be reproduced?

GeneralTspoon · on July 17, 2020

Same here!

I even checked to see if an AWS region was down once I realised it wasn't on my side (I thought it might have been my ISP's DNS servers or something).

The next move was to check Hacker News - thankfully it's not also hosted on Cloudflare, ha!

mackal · on July 17, 2020

I noticed discord being down, so I went to check downforeveryoneorjustme, also down. So I figured I'd check NANOG mailing list, also down :P

clairegraham · on July 17, 2020

Yep, we were down completely. We are quite dependent on Cloudflare (frontend + dns).

belltaco · on July 17, 2020

https://status.discord.com/ is down. Wow.

Deathmax · on July 17, 2020

And that is why you host your status page on separate infra.

Lyrex · on July 17, 2020

We're dealing with a deeper level problem here. Since a lot of the internet is relying on Cloudflare DNS at some part or another, even many backup solutions fail. Since so much of DNS is centralised in so few services, such outages hit the core infrastructure of the internet.

cm2187 · on July 17, 2020

A sudden disruption on a large number of services for everybody at once doesn't look like a DNS problem to me, with all the caching in DNS. It would fail progressively and inconsistently.

Figs · on July 17, 2020

DNS absolutely was an issue. I changed DNS manually from Cloudflare's 1.0.0.1 (which is what DHCP was giving me) to 8.8.8.8 (Google) and most things I'm trying to reach work. There may be other failures going on as well, but 1.0.0.1 was completely unreachable.

cm2187 · on July 17, 2020

I don't use cloudflare DNS but google DNS and got the same problems thant everyone else

The problem seems to have been resolved now, you might have made the change when they fixed it.

Figs · on July 17, 2020

No, I changed the setting back and forth while it was down to confirm that the issue was that I could not reach 1.0.0.1. All the entries I tried from my host file were responsive (which is how I ruled out an issue with my local equipment initially and confirmed that it wasn't a complete failure upstream -- I could still reach my servers). Changing 1.0.0.1 to 8.8.8.8 allowed me to reach websites like Google and HN, and changing back to default DNS settings (which reset to 1.0.0.1, confirmed in Wireshark) resulted in the immediate return of failures. 1.0.0.1 was not responsive to anything else I tried.

Again, it may not have been the only issue -- and there are a number of possible reasons why 1.0.0.1 wasn't reachable -- but it certainly was an issue.

cuu508 · on July 17, 2020

> I don't use cloudflare DNS but google DNS and got the same problems thant everyone else

Cloudflare is also the authoritative DNS server for many services. If Cloudflare is down, then for those services Google's DNS has nowhere to get the authoritative answers from.

cm2187 · on July 17, 2020

Except the services mentioned in the original post have a ttl of 1h. Unlikely they would all go down at the same time.

rasz · on July 18, 2020

status.discord.com 5 minutes

scarby2 · on July 17, 2020

given that TTLs are usually very short now if your DNS server is configured correctly then caching shouldn't make a bit of difference.

cm2187 · on July 17, 2020

I looked at the example above, all of patron, digitalocean, coinbase and gitlab (haven't checked the others) have a ttl of 1h.

rasz · on July 18, 2020

pr0 tip: dont set your DNS TTL to 5 min (status.discord.com does)

cortesoft · on July 17, 2020

Yeah, but then your status page provider switches THEIR provider, and suddenly you are on the same infra again.

Miner49er · on July 17, 2020

It's hosted by statuspage.io, and their own status page was also down (metastatuspage.com). It is now back up, but their page shows the outage.

proactivesvcs · on July 17, 2020

Status turtles all the way down, it seems!

bufferoverflow · on July 17, 2020

Works for me.

tremon · on July 17, 2020

I get a DNS servfail when resolving that DNS record, and many others:

    Server:    8.8.8.8
    Address:   8.8.8.8#53
    
    ** server can't find status.discord.com: SERVFAIL

So it's either not just cloudflare, or all those sites use cloudflare to host their DNS.

anderskaseorg · on July 17, 2020

The latter.

    $ dig +short discord.com ns
    sima.ns.cloudflare.com.
    gabe.ns.cloudflare.com.

WillPostForFood · on July 17, 2020

8.8.8.8 is Google and is down in addition to Cloudflare. 8.8.4.4 was still up.

pepemon · on July 17, 2020

8.8.8.8 is a caching resolver and it wasn't down. Do you understand how caching resolvers work?

christoph · on July 17, 2020

Same. It has a big banner saying CloudFlare is down.

abafazi · on July 17, 2020

Works for me too... Australia

dafoex · on July 17, 2020

I can live without creepy instant messengers, but its shocking just how much everything else relies on one, central system. And furthermore, why is it always cloudflair?

BrianHenryIE · on July 17, 2020

Cloudflare is free and has a nice UI. I manage ~40 domains from ~six domain registrars through it, the consistency is great. The caching is a bonus.

avereveard · on July 17, 2020

idk about the UI the redesign with the forced collapses and extra clicks everywhere is annoying when handling a multitude sub domains plus their let's encrypt text entries I'm there mostly for the freebies.

laughinghan · on July 17, 2020

Discord confirmed Cloudflare is also the reason they're down: https://twitter.com/discord/status/1284237737638461453

bloopernova · on July 17, 2020

Doordash too. My first order. On my wife's birthday.

Ah, well. This too shall pass.

mjayhn · on July 17, 2020

I did a pickup order a few weeks ago when of all things Tmobile SMS went down for 3+ hours. I couldn't go in the restaurant (covid) and I couldn't text them the parking # I was sitting at in a packed parking lot. I got a flood of about 50 texts a few hours later. Sat there for about an hour waiting for a $9 sandwich. I have no idea if they didn't get my order until late, or if they finally realized it was me or what. About 45 minutes in I decided to just give up on the day and take a nap, woke up to a door knock.

leon-z · on July 17, 2020

Kudos to the people at Discord. Just a few minutes after I got disconnected they already tweeted about the issue. Some minutes later and they have a message in their desktop app confirming it's an issue with Cloudflare. All while Cloudflare's statuspage says there are 'minor outages'.

mjayhn · on July 17, 2020

Every company rushes to report an outage when they can blame another vendor, well that might be hyperbolic but it's sure a lot easier!

imron · on July 17, 2020

As a percentage of total traffic, a 'minor' outage for Cloudflare probably equates to a significant outage for a non-trivial amount of the internet.

It will also be especially noticeable to end-users, because sites using Cloudflare are typically high-traffic sites, and so a 'minor' issue that affects only a handful of sites is still going to be noticed by millions of people.

_asummers · on July 18, 2020

"Your nines are not my nines" [0]

[0] https://rachelbythebay.com/w/2019/07/15/giant/

ljm · on July 17, 2020

I wonder if they are all using Cloudflare's free DNS stuff or if they're paying for business accounts?

My stuff is on Netlify (for the next week or so) and the rest is on a VPS bought from a local business who isn't reselling cloud resources. I'm kinda glad I moved all my stuff from cloudflare.

jorgenphi · on July 17, 2020

I think it's going to be everyone. Some of my free sites are dead, but also huge enterprise Cloudflare users (Discord/Patreon/4chan) are also dead.

RcouF1uZ4gsC · on July 17, 2020

> Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first.

That is why if you have this question, you should go to google.com

My guess is that there are more resources invested in making sure google.com stays up than for any other site on the internet.

qmarchi · on July 17, 2020

Depending on what part we're talking about, it varies. But yeah, just a few.

e40 · on July 17, 2020

Crazily, my local name resolution started failing, because I have these names servers: 192.168.0.99, 1.1.1.1 and 8.8.8.8. The first does the local resolution, but macOS wasn't consulting it because 1.1.1.1 was failing?? Crazy. When I removed 1.1.1.1 from the list, everything started working.

projektfu · on July 17, 2020

DNS over HTTP might bypass your local nameservers.

e40 · on July 17, 2020

What was failing was ssh to a local host. I can't imagine that Brew ssh uses DNS over HTTP.

projektfu · on July 18, 2020

Cultmethod · on July 17, 2020

Thought something like this was going on. At first I thought it was my router and restarted everything - to no avail. Glad to see confirmation that it wasn't an issue on my end.

solarkraft · on July 17, 2020

Discord works for me, but https://redbubble.com/ prints "Service unavailable".

michel-slm · on July 17, 2020

Freenode's IRC servers were down which was unexpected for me. I was expecting old-school communication networks to not have a dependency on Cloudflare.

encom · on July 17, 2020

I've had no connection interruptions to the three IRC networks I'm connected to. Freenode, EFnet and Hackint.

I loathe Discord, and I can barely contain myself with schadenfreude at this news.

fapjacks · on July 18, 2020

I also had no issues connecting to a few different networks, including Freenode and EFnet.

Reason077 · on July 17, 2020

Ironically, downdetector.com is also down.

mavdi · on July 18, 2020

Who watches the watchmen?

noobermin · on July 17, 2020

Would IRC be down?

emsy · on July 17, 2020

Same for the German downtime trackers.

thepete2 · on July 17, 2020

same here. I tried if hacker news still works and saw this

ashleyn · on July 17, 2020

It really defies the original vision of the internet to have so many services depend on a single company. Almost every news site I was reading dropped off at once. I thought for a second that I lost internet in my own house.

jeremyjh · on July 17, 2020

Yes its really odd that core backbone providers can go down and everything works like its supposed to. Even trans-pacific cables can be cut and things will usually work with only increased latency. But there is not much redundancy for many companies at this layer; having redundant DNS providers is I'm sure possible but not something we think about very often, and of course many of the sites that are down are depending on the proxy and DOS mitigation services.

On my home network I use Google as a backup DNS provider so the whole internet didn't go dark for me, but I don't have a backup DNS host for my company's DNS records.

woolcap · on July 17, 2020

Redundant DNS is possible, but challenging when you're making use of features like geo DNS that don't lend themselves to easy replication via zone transfer.

kiobu · on July 17, 2020

I imagine most people would never expect something like this to happen, so having a fallback option when Cloudflare has a huge interruption of service like this is just unthinkable.

macNchz · on July 17, 2020

All the major cloud infrastructure providers have had outages of varying severity at one point or another...it's something you'd want to take into account for, say, a system that remote controls life-critical devices, but likely isn't worth the engineering time and added complexity for a productivity or social app with a small userbase. Working on many of the latter over the years I've generally said "well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."

dijit · on July 18, 2020

>"well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."

This is a very common pattern and falls into the 'nobody got fired for buying cisco/microsoft/intel' trap.

I have two issues with it;

1) It entrenches the largest provider.

You would not extend the same leniency's of outages to the third best cloud provider, this means that people will just keep pushing the monopoly forward. Even if the uptime or service is actually better on another provider.

2) You create a tight coupling of monocultures;

Simply put: You slowly erode the internet. Your site becomes an application in a distributed mainframe operated by a tiny minority of tech companies universally based in the US.

Why is this a problem? I could give moral answers here but I thing pragmatic ones are more convincing..

Giving ownership of the internet to the few gives them the ability to set the rules.

If you're on Amazon's AWS, what's to say they don't inspect your e-commerce systems and incorporate your business logic into their amazon.com shopping experience. They do this to their marketplace and create competing products already[0].

If you're doing really well, why not just drop a few packets here and there? I mean, they wont.. you're paying, right?

Hell, if you do super well they can just change the rules and make it so your services get expensive in the exact way you use them, or even legislate you off the platform entirely.

Probably wont happen, but it's a lot of trust you have to admit, and people shit on Apple for having that kind of power, and Apple is not even competing in the same market as most people on this site.

If you're on google cloud (which, I'm a fan of btw) and you feed an ML model, well, you paid for it but why shouldn't they also have a copy.. after all, it's green to do so! Their bigdata platform? Google loves data! Feed the beast.

[0]: https://fortune.com/2016/04/20/amazon-copies-merchants/

johncolanduoni · on July 18, 2020

The trust you have to give AWS or other cloud providers is not that different than the trust you give any number of vendors (email service, phone service, etc.). You have a contract with them that says they won’t do those things, and if they ever get caught doing them all the on-premise enterprise they’re spending a lot of effort on getting on board will dry up instantly.

Amazon can copy your business model just fine without looking at your servers. Most of what’s on your servers is probably irrelevant from their perspective.

Meekro · on July 17, 2020

Agreed, but the real problem is DDoS and nobody seems to know how to globally solve it. Fighting DDoS is expensive, so you see consolidation. It's well and good to live in a tiny farming town but when raiders start attacking every week, those castle walls and guards start to look really appealing.

schoolornot · on July 17, 2020

It's nice that Cloudflare provides their services for free but scrubbing has existed for a long time. With your own address space and an appropriate budget it's not difficult to have Cloudflare/Akamai/AWS announce your IP space with a higher weight than a direct path to your infrastructure. That will give you a little bit more fault tolerance for incidents like these.

labawi · on July 17, 2020

That's what we get for externalizing costs. It's not hard to track down sources, but network operators usually let it be, hence the incentives are probably counter-productive.

hn_throwaway_99 · on July 17, 2020

Agreed, but I think people really underestimated the forces at work that would cause so much consolidation into a couple internet giants.

The original idea was that with the barrier to entry being so low, anyone and everyone could set up their own websites, mail servers, etc.

But with it being so easy to compare and contrast service (i.e. the market being so open), it means that the competitive forces naturally consolidate to a winner-take-all model. If when starting out Cloudflare was just 5% better than the competition, it could have easily taken the vast majority of the mindshare on the internet. Couple that with the fact that there are huge advantages with scale to a business like Cloudflare's, and it's not hard to see how so much of the internet has become dependent on it.

rickyc091 · on July 17, 2020

Same here. Rebooted the router and modem thinking it was me, but my phone was still on wifi then realized it was probably my cloudflare DNS.

asadm · on July 17, 2020

This! I got all sorts of alerts from pingdom and my laptop refused to get online. Pure Panic!

xen2xen1 · on July 17, 2020

Yup, reinforces the thought that you never have both DNS servers with the same service.

spiritplumber · on July 17, 2020

Pihole is your friend.

cls59 · on July 17, 2020

Yeah, Pihole made it super easy to cut over to Quad-9 once I figured out what the problem was.

rickyc091 · on July 17, 2020

Looks like I got another weekend project.

newhotelowner · on July 17, 2020

I have a pihole. It didn't help.

remmargorp64 · on July 17, 2020

I consider DNS and the way how top level domains are handled to be one of the weakest parts of our current Internet design.

We REALLY need a truly decentralized, distributed DNS system that is not owned by private entities.

the8472 · on July 17, 2020

DNS is far less of a single point of failure and more decentralized than cloudflare. Nameservers can and are operated redundantly via simple, resolver-side round-robin scheduling and the TLD servers should have longer TTLs that allow plenty of caching. The rootzone even has anycast thanks to using UDP. Take a moment to look at DoH and laugh.

You can also also register your domain on multiple TLDs.

q3k · on July 17, 2020

DNS worked just fine throughout this. You're barking up the wrong tree.

hpfr · on July 17, 2020

https://handshake.org is pretty interesting.

spenczar5 · on July 17, 2020

The "decentralized internet" folks always talk a lot about fighting corporate control. I think they should spend more time talking about resiliency and blast-radius reduction.

ghastmaster · on July 17, 2020

I just recently ran across this. I wonder how much performance would be degraded.

https://ieeexplore.ieee.org/document/7530014/authors#authors

> Unlike previous DNS replacement proposals, D 3 NS is reverse compatible with DNS and allows for incremental implementation within the current system.

xen2xen1 · on July 17, 2020

DNS is decentralized, it's just not when everyone goes with one big service.

remmargorp64 · on July 20, 2020

It might be decentralized, but how do you actually get a .com domain name without going through some kind of corporate gatekeeping or paying a fee?

tenebrisalietum · on July 17, 2020

I'm down for passing around a GPG signed hosts2.txt file. Let's get started.

Algent · on July 17, 2020

And the worst is if you try to raise concerns about cloudflare now it get brushed of as "cf already proxy half the internet, if it goes down our stuff will be minor concern".

Can_Not · on July 18, 2020

That's true, but what's a free or low cost alternative for DDoS protection for a small webapp?

cortesoft · on July 17, 2020

I don't understand why the big companies don't always have at least two CDN providers, so they can failover to another one if something like this happens.

I know a lot of big companies do, but I am always surprised when you see ones that don't.

avereveard · on July 17, 2020

the DNS itself is not as easy to duplicate across multiple provider, with CF DNS down having a backup CDN wouldn't have helped

cortesoft · on July 17, 2020

This isn't true... you can certainly do redundant dns with automatic failover between providers. Just set up NS records pointing to different providers.

lima · on July 17, 2020

That's not easy, you need to set up replication.

cortesoft · on July 18, 2020

Seems worth doing

thathndude · on July 18, 2020

My CRM was nonfunctional. That’s some critical infrastructure for me. And then I’m wondering, is it me or is it my CRM. Turns out it’s door #3 - cloudflare

lumberingjack · on July 17, 2020

Same here. I'm working at an auto parts store looking though ASE parts sites and it was like well close up the store the catalogs are missing RN.

karlmcguire · on July 17, 2020

"All systems operational"

What's the point of a status page if it doesn't reflect the real status...

It's either the status page goes down with everything else or the status page is wrong. Great.

EDIT: Looks like it's accurate now, 20 minutes later.

Jasper_ · on July 17, 2020

The point of the status page is so you can point to it for your five nines SLA and go "look? we were only down for one hour". As soon as the money relies on the metric, the metric will reflect the money.

mjlawson · on July 17, 2020

Goodhart's Law[1] in action.

[1]https://en.wikipedia.org/wiki/Goodhart%27s_law

post-it · on July 18, 2020

I've noticed everyone posts links in footnotes, is that just Hacker News etiquette?

Jasper_ · on July 18, 2020

It's old-school plain-text email etiquette. The original Markdown even used a variant of that syntax for its links definition, known as "reference links". https://daringfireball.net/projects/markdown/syntax#link

therealdrag0 · on July 18, 2020

Ya I think so. Since there isn’t a way to make html links with a short text .

zetalemur · on July 18, 2020

That's a good one, did not know that. Interesting read.

parliament32 · on July 17, 2020

Despite their update, I like how they're saying only their recursive DNS had "degraded performance", while authoritative is "operational". The entire reason everything blew up was because their authoritative nameservers weren't responding.

Xenoamorphous · on July 17, 2020

IBM Cloud status is pretty much always green... although we have issues pretty much every week.

mathattack · on July 17, 2020

They’re still using Lotus Notes for the tracking.

Miner49er · on July 17, 2020

There are status page providers that actually monitor services and automatically update. Cloudfare just doesn't use them.

jeremyjh · on July 17, 2020

Let's start a betting pool. How many upvotes do you think OP will get before the status page acknowledges a problem? I say its going to be 600.

jedberg · on July 17, 2020

You lost. ;) 476 points, status page says it's down now.

acid__ · on July 17, 2020

As one would expect, it says "degraded performance" instead of "down" lol

gpm · on July 17, 2020

Tested with tor and it's right. Some exit nodes aren't affected.

acid__ · on July 17, 2020

Hm, maybe it's just the SRE in me talking, but if major chunks of the internet being entirely inaccessible doesn't count as an "outage", what does?

gpm · on July 17, 2020

I mean, I guess your entitled to look at it that way, but I don't think it's dishonest of them to distinguish between "nothing is working" and "some regions aren't working".

innocenat · on July 18, 2020

There is no problem from the Asia site (well, it was in the wee hour, but my monitor doesn't see any problem here from Singapore)

saagarjha · on July 17, 2020

This post is getting something like 30 upvotes a minute…might want to up that a bit ;)

HappyKasper · on July 17, 2020

And it looks like they started "investigating" at around 450!

hmmazoids · on July 18, 2020

Ahh I remember when AWS went down (think it was 2 years ago now?) or at least a data center in us-east? Majority of the internet went down and status page went down as well. Man good times.

dewey · on July 17, 2020

Status pages are a marketing channel not a channel for developers most of the time. It most likely has to go through some layers before someone updates the status page.