Hacker News new | past | comments | ask | show | jobs | submit login
Cloudflare was down (cloudflarestatus.com)
926 points by dewey on July 17, 2020 | hide | past | favorite | 476 comments



Discord is entirely down right now, both the website and the app itself. Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first. Downdetector, monitortheinternet, etc.

Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy

Edit: 15 min later, looks like things are starting to come back up


Hacker News is an excellent status page for those cases.


Out of curiosity, done HN use any CDN or other way of DDOS protection? dang?


Their dns record points to only one IP (209.216.230.240) that goes to M5 Computer Security


It's a hosting company in San Diego: https://www.m5hosting.com/

I host a dedicated server there (running https://www.circuitlab.com/) and when I traceroute/ping news.ycombinator.com, it's two hops (and 0.175 ms) away :)


ówò


[flagged]


This was a BGP/routing issue and has already been documented. Please don't spread misinformation and hysteria, especially on technical issues like this

https://twitter.com/eastdakota/status/1284253034596331520


[flagged]


It was a misconfiguration we applied to a router in Atlanta during routine maintenance. That caused bad routes on our private backbone. As a result, traffic from any locations connected to the backbone got routed to Atlanta. It resulted in about 50% of traffic to our network to not resolve for about 20 minutes. Locations not connected to our backbone were not impacted. It was a human error. It was not an attack. It was not a failure or bug in the router. We're adding mitigations to our backbone network as we speak to ensure that a mistake like this can't have broad impacts in the future. We'll have a blog post with a full explanation up in the next hour or so — being written now by our CTO.


You should really read up on BGP. It really is that flimsy.


It's not failure, it's misconfiguration. Totally different issue to be resilient to fat fingers than fires.


I wouldn't count on Keemstar as a reliable source of cyber-attack coverage


It was no such thing. A Cloudflare router advertised some bad routes.


Keemstar? Not reliable source. The dude has no idea what he is talking about.


Don't think so. They used to use Cloudflare but stopped. To my knowledge, it's a single server without a database (using the filesystem as a database).


So HN is serving 5.5M page view daily (excluding API access ) on a single server without CDN and without a database?

Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Edit: The number is from Dang [1]

>These days around 5.5M page views daily and something like 5M unique readers a month, depending on how you try to count them.

[1] https://news.ycombinator.com/item?id=23808787


>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Spin up an apache installation and see how many requests you can serve per second if you're just serving static files off of an SSD. It's a lot.

edit: I see that there are already a bunch of other comments to this effect. I think you're comment is really going to bring out the old timers, haha. From my perspective, the "modern web" is absolutely insane.


>* From my perspective, the "modern web" is absolutely insane. *

Agreed.

I was brought up as a computer systems engineer... So, not a scientist, but I always worked with the basic premise of keep it simple. I've worked on projects where we built all the fangled clustering and master/slave (sorry to the PC crowd, but that's what it was called) stuff but never once needed it in practice. Our stuff could easily handle saturated gigabit networks as the 2 core cpu only running at 40%. We had cpu spare and could always add more network cards before we needed to split the server. It was less maintenance, for sure. It also had self healing so that some packets could be dropped if the client config allowed it, if the server decided it wanted to (but only ever did on the odd dodgey client connection)

That said, I was always impressed by the map-reduce of for search results (yes, I know they've moved on) which showed how massive systems can be fast too. It seemed that the rest of the world wanted to become like Google, and the complexity grew for the std software shop, when it didn't need to imho.

I jumped ship at that point and went embedded, which was a whole lot more fun for me.

Sincerely, old timer


[flagged]


How about we spend our energy fixing systemic/institutional racism first, because language will follow quite naturally.

The other way around surely doesn't work, and is just symbolic gestures without actual change.


There is only so much i can do - i'm not american and thats a change i can't make on my own aside from doing my best to be an ally when possible.

However i can open a few PRs and use some of my time to make that change. It's a minor inconvenience to me and if it makes even one black person feel heard and supported then yea, i'm gonna do it.


>I think you're comment is really going to bring out the old timers, haha.

That is great ! :D

>It's a lot.

Well yes, but HN isn't really static though. Fairly Dynamics with Huge number of users and comments. But still, I think I need to rethink lots of assumption in terms of speed, scale and complexity.


Huge numbers of users don't really mean that much. Bandwidth is the main cost, but that's kept low by having a simple design.

Serving the same content several times in a row requires very few resources - remember, reads far outnumber writes, so even dynamic comment pages will be served many times in between changes. 5.5 million page views a day is only 64 views a second, which isn't that hard to serve.

As for the writes, as long as significant serialization is avoided, it is a non-issue.

(The vast majority of websites could easily be designed to be as efficient.)


There is some caching somewhere as well, probably provides a bit more boost.

I've been at my work laptop (not logged in) and found something I wanted to reply to, so I pulled out my phone and did so. For a good 10 seconds afterwards, I could refresh my phone and see my comment, but refresh the laptop and not see it.


> From my perspective, the "modern web" is absolutely insane.

You know, it should be even better than it was in the past, because a lot of heavy lifting is now done on the client. If we properly optimized our stuff, we could potentially request tiny pieces of information from servers, as opposed to rendering the whole thing.

Kinda like native apps can do(if the backend protocols are not too bloated)


> Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

It doesn't need to be crazy.

A static site on DigitalOcean's $5 / month plan using nginx will happily serve that type of traffic.

The author of https://gorails.com hosts his entire Rails video platform app on a $20 / month DO server. The average CPU load is 2% and half the memory on the server is free.

The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases. This outage is also an unfortunate reminder that you can have the fanciest infrastructure set up ever and you're still going down due to DNS.


> The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases

True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).


> True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).

To support this, look at how many people criticize Kubernetes as being too focused on what huge companies need instead of what their small company needs. Kubernetes still has its place, but some peoples expectations may be misplaced.

For a side project, or anything low traffic with low reliability requirements, a simple VPS or share hosting suffices. Wordpress and PHP are still massively popular despite React and Node.js existing. Someone who runs a site off of shared hosting with Wordpress can have a very different vision about what their business/sideproject/etc will accomplish compared to someone who writes a custom application with a "modern" stack.


Modern web is a completely broken mess.

We were serving around that traffic off a single dual pentium 3 in 2002 quite happily off IIS/SQL Server/ASP. The amount of information presented has not grown either.

That little box had some top tier brand main corporate web sites on it too and was pumping out 30-40 requests a second peak. There was no CDN.


You were not serving that traffic, you were just serving your core functionality - no tracking, no analytics, no ads, no a/b, no dark mode, no social login, no responsiveness. Are most of those shitty? Sure, just let me know when you figure out how to pry a penny from your users for all your hard work.


Oh no, not the dark mode! The sacrifices we have to make for performance I guess...


Easy. We built something that was worth money without all that.

Not a one trick marketoid pony.


Dark mode and responsive webdesign are both good for the user and efficient for the server and user's device.


That means an average of about 63 pages per second. Let's say that the total number of queries is tenfold and take a worst case scenario and round up to 1000 queries per second and then multiply by ten to get 10k queries per second, because why not.

I don't know what the server's specs are but I'm sure it must be quite beefy and have quite a few cores, so let's say that it runs about 10 billions instructions per second. That means a budget of about one million instructions per page load in this pessimistic estimate.

The original PlayStation's CPU ran at 33MHz and most games ran at 30fps, so about 1million cycles per fully rendered frame. The CPU was also MIPS and had 4KiB of cache, so it did a lot less with a single cycle than a modern server would. Meanwhile the HN servers has the same instruction budget to generate some HTML (most of which can be cached) and send it to the client.

A middle of the line modern desktop CPU can nowadays emulate the entire PlayStation console on a single core in real time, CPU, GPU and everything else, without even breaking a sweat.

>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.

Magic, clearly.


That's only ~60 QPS, assume it is peaky and hits something more like 1000 QPS in peak minutes, but also assume most of the hits are the front page which contains so little information it would fit literally in the registers of a modern x86-64 CPU.

Even a heavyweight and badly written web server can hit 100 QPS per core, and cores are a dime a dozen these days, and storage devices that can hit a million ops per second don't cost anything anymore, either.


In-memory databases? That's amateurish. Time to make a service that runs out of ymm registers.


Not sure where your 5.5M number came from, but that's only 64 requests per second.

90 to 99% of those are logged-out users, so fully cacheable.

Only a handful of dynamic requests each second remain.


Unlike Reddit, logged in and logged out users largely see the same thing. I wouldn't imagine there is much logic involved in serving personalized pages, when they don't care who you are.


The username is embedded in the page, so you can't do full caching unfortunately. But the whole content could be easily cached and concatenated to header/footer.


You also can't do that every time because of hidden, flagged, showdead, etc.


That could be split as a setting in the user-specific header with the visibility part handled client-side. A bit more work, but it's not impossible if it's worth it.


Oh! I have never looked an HN frames, etc., but I assumed the header was separate. Thank you!


It's amazing how little CPU power it takes to run a website when you're not trying to run every analytics engine in the world and a page load only asks for 5 files (3 of which will likely already be cached by the client) that total less than 1 MB.


It is certainly doable , PoF was running lot of page views famously of a single IIS server for a long time .

HN is written in a lisp variant and most the stack is built in-house , it is not difficult to imagine efficiency improvements when many abstraction layers have been removed from your stack .


I don't remember PoF being famous for that, but they got a lot of bang for the buck on their serving costs.

What I do remember, is that it was a social data collection experiment for a couple of social scientists, that never originally expected that many people would actually find ways to find each other and hook up using it.

I miss their old findings reports about how weird humans are and what they lie about. Now, it's just plain boring with no insights released to the public.


POF was also run by a single dude until he sold it for some 600 million dollars!


There are couple of old posts about it in hpc blog .

Nick carver from SO also one mentioned that they could run SO if a single server , while it wasn’t fun it was doable and had happened some time .


For all my sibling comments, there is also context to be aware of. 5.5m page views daily can come is many shapes and sizes. Yes, modern web dev is a mess, but situation is very different from site to site. This should be taken as a nice anecdote, not as a benchmark.


You can serve a lot of flat files from a properly configured server in 2020. It's just that most people don't bother trying.


You don’t need a CDN if what you’re serving up in this case is mostly all text.

Just need good stable code and server side caching.


Back in 2000 a joke project of mine got slashdotted. Ran outta bandwidth before anything else.


With DO, these days, they don't run me out of bandwidth, but my instance falls over (depending on what I am doing - which ain't much), but with AWS, they auto-scale and I get a $5000 bill at the end of the month. I prefer the former.


Yeah, the bandwidth overage was a grand, give or take. It was a valuable lesson in a number of ways, and why for any personal things I wouldn't touch AWS with a shitty stick.


AWS only auto scales if you configure it to...


Yeah, but I'm stupid and didn't understand all the switches.


S3 will scale as much as is needed by itself


A lot of modern web technology is inefficient for the sake of being ergonomic. Here's what Hacker News looks like: https://github.com/arclanguage/anarki/blob/master/apps/news/...


From an old-school lisper perspective, the code seems perfectly ergonomic to me.

It's ergonomic in a very lispy way but perfectly reasonably so from the POV of that aesthetic.


Well, it's not magic. So, the other one.


I had this same reaction. Definitely feels like most of what we’re doing with “the modern” web is probably wrong.


That's why we have this bloated over-engineered multi-node applications: People just underestimate how fast modern computing is. To serve ~2^6 requests/sec is trivial.

It's easily served by a simple server.


1M queries per day is ~10 queries per second. It's a useful conversion rate to keep in mind when you see anyone brag about millions of requests per day.


that number is not very big

i used to host a wordpress site that has 5M pageviews a month on a $10 (and later $20) digitalocean instance.

that's wordpress and a shared vps. I imagine it could be a lot higher if I have dedicated server and use self-written software.


You have to remember there's a lot of seconds in a day, that's only 60 qps.


HN could probably be served by Python running on a fancy laptop.


Everything we're doing is wrong.


Flat-file DBs and mountable DB file systems are the future.


Man, if this is true, this guys have steel balls.


We've been telling this for a while now...


I don't get this comment, what does page serving performance have to do with "the modern web"? It's not as if serving up a js payload would make it more difficult to host millions of requests on a single machine, html and js are both just text.


Makes sense based on what I've read about Arc, which HN is written in.

I've been working on something where the DB is also part of the application layer. The performance you can get on one machine is insane, since you spend minimal time on marshalling structures and moving things around.


"They used to use Cloudflare but stopped."

They are still using Cloudflare. Unlike CF, M5 does not require SNI.

   curl --resolve news.ycombinator.com:443:104.20.43.44 https://news.ycombinator.com


It's hosted on AWS, you don't need DDoS protection, just a big wallet.


Thats not even slightly true.

Their IP belongs to AS21581; which is registered with a company called 'M5 Computer hosting' out of west-coast USA.

m5hosting.com

The last hop is Santa Barbera.

Definitely does not fall in the AWS ranges.


Hacker News is self-hosted: https://news.ycombinator.com/item?id=22767439. Let me see if I can find a better link where the specs are discussed.


Only if you use their "infinitely scaling" services. eg. s3. If the attacker is hammering you with expensive queries and your database is on 1 ec2 server, you're still going to go down.


The nameserver is aws but IP4 points to “m5 computer security”


My iPhone actually popped up a message saying that my wifi didn't appear to have internet, which was strange and obviously false as I was actively using the internet on it and the laptop next to it, but now it makes sense that it must have been pinging something backed by cloudflare!


Discord attempted to route me to: everydayconsumers.com/displaydirect2?tp1=b49ed5eb-cc44-427d-8d30-b279c92b00bb&kw=attorney&tg1=12570&tg2=216899.marlborotech.com_47.36.66.228&tg3=fbK3As-awso

(Visit at your own risk.)

Hack?


I'd be looking at your browser extensions or malware (if you use the Discord app).


Sure you didn't misspell discord?


I've never even heard of the site before. Nor have I searched for "attorney" any time recently.


We operate that site and are using Cloudflare to prevent DDOS attacks. Probably some sort of hash collision...


Crazy stuff!


It looks like you mistyped it and landed on a domain with spammy redirects. They have all kinds of weird URLs and there's not always any connection to anything you did other than go to the wrong domain.


I didn't mis-type. I press 'd' and Firefox fills-in the site for me.

If I type 'e' I get 'en.wikipedia.org'.

I was redirected.


How can this be reproduced?


Same here!

I even checked to see if an AWS region was down once I realised it wasn't on my side (I thought it might have been my ISP's DNS servers or something).

The next move was to check Hacker News - thankfully it's not also hosted on Cloudflare, ha!


I noticed discord being down, so I went to check downforeveryoneorjustme, also down. So I figured I'd check NANOG mailing list, also down :P


Yep, we were down completely. We are quite dependent on Cloudflare (frontend + dns).



And that is why you host your status page on separate infra.


We're dealing with a deeper level problem here. Since a lot of the internet is relying on Cloudflare DNS at some part or another, even many backup solutions fail. Since so much of DNS is centralised in so few services, such outages hit the core infrastructure of the internet.


A sudden disruption on a large number of services for everybody at once doesn't look like a DNS problem to me, with all the caching in DNS. It would fail progressively and inconsistently.


DNS absolutely was an issue. I changed DNS manually from Cloudflare's 1.0.0.1 (which is what DHCP was giving me) to 8.8.8.8 (Google) and most things I'm trying to reach work. There may be other failures going on as well, but 1.0.0.1 was completely unreachable.


I don't use cloudflare DNS but google DNS and got the same problems thant everyone else

The problem seems to have been resolved now, you might have made the change when they fixed it.


No, I changed the setting back and forth while it was down to confirm that the issue was that I could not reach 1.0.0.1. All the entries I tried from my host file were responsive (which is how I ruled out an issue with my local equipment initially and confirmed that it wasn't a complete failure upstream -- I could still reach my servers). Changing 1.0.0.1 to 8.8.8.8 allowed me to reach websites like Google and HN, and changing back to default DNS settings (which reset to 1.0.0.1, confirmed in Wireshark) resulted in the immediate return of failures. 1.0.0.1 was not responsive to anything else I tried.

Again, it may not have been the only issue -- and there are a number of possible reasons why 1.0.0.1 wasn't reachable -- but it certainly was an issue.


> I don't use cloudflare DNS but google DNS and got the same problems thant everyone else

Cloudflare is also the authoritative DNS server for many services. If Cloudflare is down, then for those services Google's DNS has nowhere to get the authoritative answers from.


Except the services mentioned in the original post have a ttl of 1h. Unlikely they would all go down at the same time.


status.discord.com 5 minutes


given that TTLs are usually very short now if your DNS server is configured correctly then caching shouldn't make a bit of difference.


I looked at the example above, all of patron, digitalocean, coinbase and gitlab (haven't checked the others) have a ttl of 1h.


pr0 tip: dont set your DNS TTL to 5 min (status.discord.com does)


Yeah, but then your status page provider switches THEIR provider, and suddenly you are on the same infra again.


It's hosted by statuspage.io, and their own status page was also down (metastatuspage.com). It is now back up, but their page shows the outage.


Status turtles all the way down, it seems!


Works for me.


I get a DNS servfail when resolving that DNS record, and many others:

    Server:    8.8.8.8
    Address:   8.8.8.8#53
    
    ** server can't find status.discord.com: SERVFAIL
So it's either not just cloudflare, or all those sites use cloudflare to host their DNS.


The latter.

    $ dig +short discord.com ns
    sima.ns.cloudflare.com.
    gabe.ns.cloudflare.com.


8.8.8.8 is Google and is down in addition to Cloudflare. 8.8.4.4 was still up.


8.8.8.8 is a caching resolver and it wasn't down. Do you understand how caching resolvers work?


Same. It has a big banner saying CloudFlare is down.


Works for me too... Australia


I can live without creepy instant messengers, but its shocking just how much everything else relies on one, central system. And furthermore, why is it always cloudflair?


Cloudflare is free and has a nice UI. I manage ~40 domains from ~six domain registrars through it, the consistency is great. The caching is a bonus.


idk about the UI the redesign with the forced collapses and extra clicks everywhere is annoying when handling a multitude sub domains plus their let's encrypt text entries I'm there mostly for the freebies.


Discord confirmed Cloudflare is also the reason they're down: https://twitter.com/discord/status/1284237737638461453


Doordash too. My first order. On my wife's birthday.

Ah, well. This too shall pass.


I did a pickup order a few weeks ago when of all things Tmobile SMS went down for 3+ hours. I couldn't go in the restaurant (covid) and I couldn't text them the parking # I was sitting at in a packed parking lot. I got a flood of about 50 texts a few hours later. Sat there for about an hour waiting for a $9 sandwich. I have no idea if they didn't get my order until late, or if they finally realized it was me or what. About 45 minutes in I decided to just give up on the day and take a nap, woke up to a door knock.


Kudos to the people at Discord. Just a few minutes after I got disconnected they already tweeted about the issue. Some minutes later and they have a message in their desktop app confirming it's an issue with Cloudflare. All while Cloudflare's statuspage says there are 'minor outages'.


Every company rushes to report an outage when they can blame another vendor, well that might be hyperbolic but it's sure a lot easier!


As a percentage of total traffic, a 'minor' outage for Cloudflare probably equates to a significant outage for a non-trivial amount of the internet.

It will also be especially noticeable to end-users, because sites using Cloudflare are typically high-traffic sites, and so a 'minor' issue that affects only a handful of sites is still going to be noticed by millions of people.


"Your nines are not my nines" [0]

[0] https://rachelbythebay.com/w/2019/07/15/giant/


I wonder if they are all using Cloudflare's free DNS stuff or if they're paying for business accounts?

My stuff is on Netlify (for the next week or so) and the rest is on a VPS bought from a local business who isn't reselling cloud resources. I'm kinda glad I moved all my stuff from cloudflare.


I think it's going to be everyone. Some of my free sites are dead, but also huge enterprise Cloudflare users (Discord/Patreon/4chan) are also dead.


> Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first.

That is why if you have this question, you should go to google.com

My guess is that there are more resources invested in making sure google.com stays up than for any other site on the internet.


Depending on what part we're talking about, it varies. But yeah, just a few.


Crazily, my local name resolution started failing, because I have these names servers: 192.168.0.99, 1.1.1.1 and 8.8.8.8. The first does the local resolution, but macOS wasn't consulting it because 1.1.1.1 was failing?? Crazy. When I removed 1.1.1.1 from the list, everything started working.


DNS over HTTP might bypass your local nameservers.


What was failing was ssh to a local host. I can't imagine that Brew ssh uses DNS over HTTP.


Yup


Thought something like this was going on. At first I thought it was my router and restarted everything - to no avail. Glad to see confirmation that it wasn't an issue on my end.


Discord works for me, but https://redbubble.com/ prints "Service unavailable".


Freenode's IRC servers were down which was unexpected for me. I was expecting old-school communication networks to not have a dependency on Cloudflare.


I've had no connection interruptions to the three IRC networks I'm connected to. Freenode, EFnet and Hackint.

I loathe Discord, and I can barely contain myself with schadenfreude at this news.


I also had no issues connecting to a few different networks, including Freenode and EFnet.


Ironically, downdetector.com is also down.


Who watches the watchmen?


Would IRC be down?


Same for the German downtime trackers.


same here. I tried if hacker news still works and saw this


It really defies the original vision of the internet to have so many services depend on a single company. Almost every news site I was reading dropped off at once. I thought for a second that I lost internet in my own house.


Yes its really odd that core backbone providers can go down and everything works like its supposed to. Even trans-pacific cables can be cut and things will usually work with only increased latency. But there is not much redundancy for many companies at this layer; having redundant DNS providers is I'm sure possible but not something we think about very often, and of course many of the sites that are down are depending on the proxy and DOS mitigation services.

On my home network I use Google as a backup DNS provider so the whole internet didn't go dark for me, but I don't have a backup DNS host for my company's DNS records.


Redundant DNS is possible, but challenging when you're making use of features like geo DNS that don't lend themselves to easy replication via zone transfer.


I imagine most people would never expect something like this to happen, so having a fallback option when Cloudflare has a huge interruption of service like this is just unthinkable.


All the major cloud infrastructure providers have had outages of varying severity at one point or another...it's something you'd want to take into account for, say, a system that remote controls life-critical devices, but likely isn't worth the engineering time and added complexity for a productivity or social app with a small userbase. Working on many of the latter over the years I've generally said "well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."


>"well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."

This is a very common pattern and falls into the 'nobody got fired for buying cisco/microsoft/intel' trap.

I have two issues with it;

1) It entrenches the largest provider.

You would not extend the same leniency's of outages to the third best cloud provider, this means that people will just keep pushing the monopoly forward. Even if the uptime or service is actually better on another provider.

2) You create a tight coupling of monocultures;

Simply put: You slowly erode the internet. Your site becomes an application in a distributed mainframe operated by a tiny minority of tech companies universally based in the US.

Why is this a problem? I could give moral answers here but I thing pragmatic ones are more convincing..

Giving ownership of the internet to the few gives them the ability to set the rules.

If you're on Amazon's AWS, what's to say they don't inspect your e-commerce systems and incorporate your business logic into their amazon.com shopping experience. They do this to their marketplace and create competing products already[0].

If you're doing really well, why not just drop a few packets here and there? I mean, they wont.. you're paying, right?

Hell, if you do super well they can just change the rules and make it so your services get expensive in the exact way you use them, or even legislate you off the platform entirely.

Probably wont happen, but it's a lot of trust you have to admit, and people shit on Apple for having that kind of power, and Apple is not even competing in the same market as most people on this site.

If you're on google cloud (which, I'm a fan of btw) and you feed an ML model, well, you paid for it but why shouldn't they also have a copy.. after all, it's green to do so! Their bigdata platform? Google loves data! Feed the beast.

[0]: https://fortune.com/2016/04/20/amazon-copies-merchants/


The trust you have to give AWS or other cloud providers is not that different than the trust you give any number of vendors (email service, phone service, etc.). You have a contract with them that says they won’t do those things, and if they ever get caught doing them all the on-premise enterprise they’re spending a lot of effort on getting on board will dry up instantly.

Amazon can copy your business model just fine without looking at your servers. Most of what’s on your servers is probably irrelevant from their perspective.


Agreed, but the real problem is DDoS and nobody seems to know how to globally solve it. Fighting DDoS is expensive, so you see consolidation. It's well and good to live in a tiny farming town but when raiders start attacking every week, those castle walls and guards start to look really appealing.


It's nice that Cloudflare provides their services for free but scrubbing has existed for a long time. With your own address space and an appropriate budget it's not difficult to have Cloudflare/Akamai/AWS announce your IP space with a higher weight than a direct path to your infrastructure. That will give you a little bit more fault tolerance for incidents like these.


That's what we get for externalizing costs. It's not hard to track down sources, but network operators usually let it be, hence the incentives are probably counter-productive.


Agreed, but I think people really underestimated the forces at work that would cause so much consolidation into a couple internet giants.

The original idea was that with the barrier to entry being so low, anyone and everyone could set up their own websites, mail servers, etc.

But with it being so easy to compare and contrast service (i.e. the market being so open), it means that the competitive forces naturally consolidate to a winner-take-all model. If when starting out Cloudflare was just 5% better than the competition, it could have easily taken the vast majority of the mindshare on the internet. Couple that with the fact that there are huge advantages with scale to a business like Cloudflare's, and it's not hard to see how so much of the internet has become dependent on it.


Same here. Rebooted the router and modem thinking it was me, but my phone was still on wifi then realized it was probably my cloudflare DNS.


This! I got all sorts of alerts from pingdom and my laptop refused to get online. Pure Panic!


Yup, reinforces the thought that you never have both DNS servers with the same service.


Pihole is your friend.


Yeah, Pihole made it super easy to cut over to Quad-9 once I figured out what the problem was.


Looks like I got another weekend project.


I have a pihole. It didn't help.


I consider DNS and the way how top level domains are handled to be one of the weakest parts of our current Internet design.

We REALLY need a truly decentralized, distributed DNS system that is not owned by private entities.


DNS is far less of a single point of failure and more decentralized than cloudflare. Nameservers can and are operated redundantly via simple, resolver-side round-robin scheduling and the TLD servers should have longer TTLs that allow plenty of caching. The rootzone even has anycast thanks to using UDP. Take a moment to look at DoH and laugh.

You can also also register your domain on multiple TLDs.


DNS worked just fine throughout this. You're barking up the wrong tree.


https://handshake.org is pretty interesting.


The "decentralized internet" folks always talk a lot about fighting corporate control. I think they should spend more time talking about resiliency and blast-radius reduction.


I just recently ran across this. I wonder how much performance would be degraded.

https://ieeexplore.ieee.org/document/7530014/authors#authors

> Unlike previous DNS replacement proposals, D 3 NS is reverse compatible with DNS and allows for incremental implementation within the current system.


DNS is decentralized, it's just not when everyone goes with one big service.


It might be decentralized, but how do you actually get a .com domain name without going through some kind of corporate gatekeeping or paying a fee?


I'm down for passing around a GPG signed hosts2.txt file. Let's get started.


And the worst is if you try to raise concerns about cloudflare now it get brushed of as "cf already proxy half the internet, if it goes down our stuff will be minor concern".


That's true, but what's a free or low cost alternative for DDoS protection for a small webapp?


I don't understand why the big companies don't always have at least two CDN providers, so they can failover to another one if something like this happens.

I know a lot of big companies do, but I am always surprised when you see ones that don't.


the DNS itself is not as easy to duplicate across multiple provider, with CF DNS down having a backup CDN wouldn't have helped


This isn't true... you can certainly do redundant dns with automatic failover between providers. Just set up NS records pointing to different providers.


That's not easy, you need to set up replication.


Seems worth doing


My CRM was nonfunctional. That’s some critical infrastructure for me. And then I’m wondering, is it me or is it my CRM. Turns out it’s door #3 - cloudflare


Same here. I'm working at an auto parts store looking though ASE parts sites and it was like well close up the store the catalogs are missing RN.


"All systems operational"

What's the point of a status page if it doesn't reflect the real status...

It's either the status page goes down with everything else or the status page is wrong. Great.

EDIT: Looks like it's accurate now, 20 minutes later.


The point of the status page is so you can point to it for your five nines SLA and go "look? we were only down for one hour". As soon as the money relies on the metric, the metric will reflect the money.


Goodhart's Law[1] in action.

[1]https://en.wikipedia.org/wiki/Goodhart%27s_law


I've noticed everyone posts links in footnotes, is that just Hacker News etiquette?


It's old-school plain-text email etiquette. The original Markdown even used a variant of that syntax for its links definition, known as "reference links". https://daringfireball.net/projects/markdown/syntax#link


Ya I think so. Since there isn’t a way to make html links with a short text .


That's a good one, did not know that. Interesting read.


Despite their update, I like how they're saying only their recursive DNS had "degraded performance", while authoritative is "operational". The entire reason everything blew up was because their authoritative nameservers weren't responding.


IBM Cloud status is pretty much always green... although we have issues pretty much every week.


They’re still using Lotus Notes for the tracking.


There are status page providers that actually monitor services and automatically update. Cloudfare just doesn't use them.


Let's start a betting pool. How many upvotes do you think OP will get before the status page acknowledges a problem? I say its going to be 600.


You lost. ;) 476 points, status page says it's down now.


As one would expect, it says "degraded performance" instead of "down" lol


Tested with tor and it's right. Some exit nodes aren't affected.


Hm, maybe it's just the SRE in me talking, but if major chunks of the internet being entirely inaccessible doesn't count as an "outage", what does?


I mean, I guess your entitled to look at it that way, but I don't think it's dishonest of them to distinguish between "nothing is working" and "some regions aren't working".


There is no problem from the Asia site (well, it was in the wee hour, but my monitor doesn't see any problem here from Singapore)


This post is getting something like 30 upvotes a minute…might want to up that a bit ;)


And it looks like they started "investigating" at around 450!


Ahh I remember when AWS went down (think it was 2 years ago now?) or at least a data center in us-east? Majority of the internet went down and status page went down as well. Man good times.


Status pages are a marketing channel not a channel for developers most of the time. It most likely has to go through some layers before someone updates the status page.


This is an Atlassian Statuspage status page, so it's not hosted by Cloudflare.


this-is-fine.gif


I don't think it's just Cloudflare; I just had a fun 10 minutes seeing servers start flipping on my Server monitoring service[1]. This has only happened once or twice per year, and is usually due to weird global DNS issues.

[1] https://servercheck.in/

(To give an update, I'm seeing from my monitoring systems (about 15 points around the globe) sporadic outages for Microsoft, Apple, Reddit, Bing, Node.js, Twitter, Yahoo, and YouTube. And my own servers (not behind CF at all) are also flipping up and down. It started around 21:14 UTC.)


a DNS issue wouldn't cripple all of the internet at once, with all the caching.


Most sites set the absolute minimum TTL for every record, for no reason. There’s a lot less caching than you’re thinking.


Eh, what? There are many good reasons to have low TTL DNS...this exact outage being one of them. Update your records to go direct to your servers, and not through Cloudflare and bam you’re back up. Doesn’t work if your TTL is 86400


Doesn't help as cloudflare wants you to host their name servers with them, so you can't flip any records if the DNS itself is in trouble, like it is now

And changing DNS servers often takes many hours (or days, if .net is involved apparently)


No I see some services failing that have a TTL of 1h.


It was interesting that we saw our domains affected from the USA but from Mexico everything looked OK.

The crazier thing is that I tried to login to our CloudFlare account, it never sent me the 2FA code... I still haven't been able to login (Enterprise account)


This was a problem with our backbone network; wasn't caused by an attack. The effect was regional and not global. Naturally, we'll write it all up.


Was it a problem with a provider you use?


Looks like problem with one of our large routers in Atlanta.


I am in Nashville, and lost 1.1.1.1 and 1.0.0.1 completely. Also one of my sites went down according to statuscake, but not all of them.


Looking forward to the writeup- were these IGP routes?


We were down (downforeveryoneorjustme.com) completely, but back up now (as of a few minutes ago). Our domain wasn't even resolving; we use Cloudflare for frontend and DNS.

We had a surge of people checking if Discord was down on our site, then I noticed everything went down shortly after. Discord is still the top check right now.

I can't ever remember hitting these kind of traffic numbers before.


Interesting data you get in the face of adversity, providing your host resolves!


Funny, I tried to use your site because the website I was trying to access stopped working. But your site was also down so I figured it was just my internet being cranky :/


I enjoy your service. Have you ever thought about expanding your offerings? I would love to see a recreation of "Internet Pulse"


Thanks! Yep, we have a lot of things on the todo. We want to add more user-focused / location-based outage information since our site is still too reliant on simple HTTP checks to report downtime. This is especially a problem with a Discord outage, for example, where the frontend website is not down, but there might be problems with the API, apps, or other components.

And I'd like to be able to have our site communicate outages like this Cloudflare one, where more than one site might be affected by a larger provider. Automating that is difficult.

This is still a side project, though, so I mostly work on it when I get the urge :)


That would be pretty interesting. Being able to drill down into individual pieces of a stack would be very informative, especially from a parent source. I bet a lot of services would even have that information readily available in their documentation for APIs.

How comfortable are you with open source? If you were willing to release your stack on Gitlab/Github, it might be worth your while.


Something’s wonky, because it’s not just Cloudflare. One of my personal sites is down that uses nothing but a VPS, and I noticed my Unifi AP disconnect from its controller a little bit ago. Fiber cut? Routing issues?


If that VPS is on DO they're down too cause of CF. Or if you set the resolver on your VPS to 1.1.1.1 that's also down.


Why are digital ocean VPSes down due to a Cloudflare outage? Hoping for a clarifying post mortem...


My Digital Ocean load balancer went down. I think there's probably some internal routing? Would be interested to understand more.


digitaloceans VPSes weren't down, but there do seem to be routing issues as TransIP can't reach DigitalOcean AMS3 (but it's all coming back now)

Maybe the problem was somewhere on the AMS-IX


DO is still up as my machines are still up and accessible.


Huh. My Ubiquiti was reporting WAN link down during this outage. I'm using ATT fiber. I'm wondering if "link down" doesn't mean what I think it means. Now that I check, it says "WAN iface [eth2] transition to state [inactive]". I'm wondering if that means link down or if it's doing service checking.

I actually have a WAN2 configured but not plugged in and it was set to "Load Balancing: Failover Only" ... I wonder if all of my 'connection issues' were software assuming my network link is down and switching interfaces to an unavailable one.


to reply to myself, if you have a second interface configured for failover, it actually tests against ping.ubnt.com. I bet every single time my ATT fiber has "gone out" for a minute or two at a time, it's been bogus.

  root@USG-PRO-4:~# show load-balance watchdog
  Group wan_failover
  eth2
  status: Running
  pings: 2
  fails: 0
  run fails: 0/3
  route drops: 0
  ping gateway: ping.ubnt.com - REACHABLE

  eth3
  status: Waiting on recovery (0/3)
  failover-only mode
  pings: 1
  fails: 1
  run fails: 3/3
  route drops: 1
  ping gateway: ping.ubnt.com - DOWN
  last route drop   : Fri Jul 17 17:32:58 2020


We can't keep going on like this. The vulnerability of centralised internet infrastructure is a huge problem for everyone. Somebody, somewhere, really ought to sort it all out


> Somebody, somewhere, really ought to sort it all out

That could be the slogan for 2020


10-20 minute router misconfigurations and subsequent fixes are sometimes a fact of life. big network infrastructure is complicated, and sometimes the best laid route tables of mice and men do go abloop and die.

Outages happen no matter what the infrastructure is. There's no solution, they're just something you need to recognize and handle, which Cloudflare seemingly did relatively quickly here.


Yes, but other providers are not a single point of failure for a significant percentage of the internet.

Level 3 or Telia going offline is perfectly survivable for any customer who has multiple upstreams.


Remember the big Dyn outage? Or when AWS-US-East was severely disrupted by a hurricane?

It may perhaps be an exaggeration to say that there are not other providers that are similarly critical for a significant percentage of the internet.


Think Back to the 80s. Imagine you’re watching the super bowl. Imagine it goes off for 15 minutes.


If it impacts enough wallets, things might change. I'm not holding my breath though.


Why not you? Just don't use CF. The more people stay away from CF, the better.


I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?

I definitely agree in concept with you, but then i think back to how frequently script kiddies took down sites ~10 years ago, or w/e. I feel like what has changes is the massive CDNs in front of so many sites.

So while i do want a better solution, i'm not sure what it looks like. Thoughts?


Does it really matter? If you're small, who cares if you go down for half an hour? What, you'll make $0.02 this hour instead of $0.05? If you're big, you can afford your own infrastructure. Stick a few servers in a few colos around the world and you'll have better uptime than CF and friends anyway.


You're talking thousands of dollars for colo servers vs. free for CF


> I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?

Reddit/HN/etc will send all users to the same URL. Almost all of those users will come without any pre-existing cookies. Serving the same content to all those users should not be impossible for most sites without CF or a CDN.


Some kind of decentralized CDN in theory.


CDNs are decentralized by nature.


Peer-to-peer, something like a blockchain where no entity controls all the nodes.


Sounds like a problem for... us!


If only there was some website full of computerphiles...


Be the change you want to see in the world :) There are no somebodys somewheres.

One question is how to do DDoS protection without somebody like Cloudflare. Some new protocol for edge caching, perhaps?


DDoS has two components:

a) complexity: trick your servers into doing something hard

b) volumetric: overwelm your servers with a lot of traffic

c) volumetric part two: overwelm your servers with a lot of requests, so you respond with a lot of traffic

A and C are things you can work on your self --- try to limit the amount of work your server does in response to requests, and/or make resource consuming responses require resource consuming requests; and monitor and fix hotspots as they're found.

B is tricky, there's two ways to solve volumentric attack; either have enough bandwidth to drop the packets on your end, or convince the other end to drop the packets (usually called null routing). Null routes work great, but usually drop all packets to a particular destination IP, which means you need to move your service to another IP if you want it to stay online; that's hard to do if your IP needs to stay fixed for a meaningful time (TTL for glue records at TLDs is usually at least a day); and IP space is limited, so if your attackers are quick at moving attacks, you could run out of IPs to use. Some attacks are going above 1 Tbps though, so that's a lot of bandwidth if you need to accept and drop; and of course, the more bandwidth people get so they can weather attacks, the more bandwidth that can be used to attack others if it's not well secured.


I'm not very familiar with DDoS protection strategies. Can you please elaborate on what is meant in (c) by "make resource consuming responses require resource consuming requests"?


Make people login before doing a search is a common example for forums. Search is hard, unauthenticated search will bring low end forums down, so they make you create an account and login.

That sort of thing.


B) is the only one that really needs a solution and traffic is breaking two or three levels above you.


Or just stop using the internet. The majority of tech problems stem from people using tech. Don't rely on it alone and you don't have problems.


Yells 'peterwwillis on the internet ;)


Yesterday I noticed most of their lava lamps are out (which generate random bits). Perhaps these are a critical component.

https://photos.app.goo.gl/g6eR8V2PSY3EVjCLA


I'm sure you were joking but they actually are: https://blog.cloudflare.com/lavarand-in-production-the-nitty...


Despite them actually mixing it into their entropy pools, the lava lamps are still entirely for show. The noise of the camera sensor itself is going to contribute orders of magnitude more entropy than the slow movement of the lamps. It's not completely a fake stunt, but it's certainly headline-optimized.


Someone needs to get these lava lamps plugged back in ASAP!


Good guy cloudflare giving programmers an early weekend.


But overtime for their own programmers :)


A tweet from Cloudflare's CEO:

> Appears that the router in Atlanta announced bad routes (effectively a route leak). Only impacted our backbone. Not all of our PoPs are connected to our backbone, so some would not have seen an issue. Appears to have impacted about 50% of our traffic for a bit over 20 min.

https://twitter.com/eastdakota/status/1284259895475236865


I am probably the only one who cares but even https://downforeveryoneorjustme.com/ is down


I personally find sites like https://outage.report or https://downdetector.com which tally up the number, regions and history of people saying it isn't working for them more conclusive.


Now we need isdownforeveryoneorjustmedown.com


Yep, we are very dependent on Cloudflare :(


It's up for me.


First place I went to also. Then HN.


Weird, it works for me (US)


That page still shows "Cloudflare System Status: All Systems Operational" for me, but it's definitely down for me. Along with 1.1.1.1, which is... bad.


For me it says "Minor System Outage" for about 0.1s and then shows "All Systems Operational".


Everything works fine for me (Canada), am I missing something or is it over already?


Also in Canada. Shit's fucked, yo.


Still down here.


Having DNS issues. Had to switch to Google's DNS (8.8.8.8/8.8.4.4) as 1.1.1.1/etc were not resolving anything.


Same. Even then, a bunch of sites are down. Maybe only ones behind Cloudflare? So far I've been trying to hit the various down detector sites and none of them will load. Google, Reddit, Hackernews are all fine.


From what little I've been able to gather, anything using Cloudflare's DNSes are down.


Same. Had to go through the entire troubleshooting process --- is it my internet connection? my DNS resolver? firewall? ISP starting to filter DoT queries somehow?

Only last in my mental list was the possibility that Cloudflare would be down.

Hope they publish a detailed post-mortem. It's always fun to read (but certainly very painful for those directly involved in writing it).


1.1.1.1 is dropping ICMP pings while 8.8.8.8 is not but 8.8.8.8 is still returning DNS errors.


That only worked until all the records expired from Google's cache.


Monday Morning RCA: "We pushed out some routine code updates, but this really weird thing happened causing a resource utilization spike on our DNS systems. Because of this other really weird thing, this affected all of our global infrastructure simultaneously. Here's a deep engineering dive into this one weird thing that brought everything down."


Cloudflare's DNS (1.1.1.1) is failing to respond to most/all queries, which I'm observing as the root cause of a bunch of connection issues (name lookup failure).

Interestingly the same domains don't show up on google's (8.8.8.8) DNS at all.


8.8.8.8 is a caching resolver, it still needs to talk to CF's nameservers for authoritative records.


Lol, talk about timing. I'm currently working on a TLS library and was pulling my hair out trying to figure out why tests against CF sites suddenly failed. Can't even ask my cohorts on Discord because they are behind CF, too!


"StackOverflow devs have the most difficult job in the world. After all, when StackOverflow is down, they can't exactly look for help on StackOverflow".


No kidding! We had literally deployed a major page redesign and started watching our analytics drop off on it's way to zero. My heart is racing still. I wouldn't normally be happy for a cloudflare outage but in this case it's better than Google deciding to remove us from their index.


That's unfortunate timing dude. Good for you, it's probably not your mistake :D


I can't change my NS records to point to a different DNS provider because my registrar, Namecheap, also uses Cloudflare. Didn't expect that.


Yep, Cloudflare, DigitalOcean and 1.1.1.1 down for me. I thought it was my internet and was so confused for a bit there.


Friendly reminder (and notes to myself):

Don't use Namecheap and Cloudflare at the same time.

Namecheap is using cloudflare. So if cloudflare is down, you can't change DNS settings on Namecheap as well!


Reminder for firefox users: Firefox uses DNS over HTTPS and the default is cloudflare. If you're having DNS issues, you need to disable it until cloudflare is back up.


Oopsie Daisy half the internet goes down


2020 is all in on everything bad that can happen.


Eggs and baskets etc etc


Good DNS practice (at least when I did system admin 10 years ago) was ALWAYS having a secondary at some other location/network. Why do we just put some info in Cloudflare and call it good these days?


It's hard to use Cloudflare as a reverse proxy without using them as your delegated name servers (maybe you can use CNAMEs on paid plans?), and fancy dynamic nameservers make it hard to run secondary servers with zone transfers.


This is definitely the answer I needed.


Because we don't do tech work anymore. We put in a credit card and click "Buy Now." Also, nobody got fired for using Cloudflare.


Features, convience, pricing.


What luck, I chose today to install a new piece of network gear. I thought I had managed to totally FUBAR my network. DNS was failing, "ping 1.1" (my current goto test "Am I connected to the internet?" as it requires the fewest keystrokes and hits the Cloudflare DNS 1.0.0.1) failed and I just assumed it was my fault. Backed out my changes, and discover in fact, the internet was down.


Ping 1.1... thanks for that!


I was trying to fix my router the last 15 minutes :)


Yeah I was waiting for it to fix then tried cell phone and realized that was down too. I assumed it was an issue regional / backbone routing or something. Especially cause status pages which I wouldn't expect to be hosted on AWS (because of the need for status pages to stay up when AWS goes down) seemed to also be down. Didn't realize it could be Cloudflare...


Same here. Only figured it out because just one of the computers uses Cloudflare dns and the others were fine...


Ditto except visa versa. My machine is set to the router which uses cloudflare. Other machines use whatever is default for mac(I try not to touch those). Once I realized they were working and I could access internal network from outside, I started diagnosing DNS. Came here via 8.8.8.8.


Our TPU management page is also down: https://www.tensorfork.com/tpus

Seems cloudflare took out a good chunk of the internet temporarily.

Doesn’t HN use cloudflare? Why did it survive? (I haven’t looked for about a year, but I seem to remember HN being proxied behind CF at one point.)



How do you deal with DDoS attacks?


It doesn't look like it does. HN's IP space belongs to "M5 Computer Security", and their DNS nameservers are on "awsdns". Nothing there to suggest CF.


Is there a status page for HN?


Yeah, it's whether news.ycombinator.com loads :P


EDIT: Yes: https://twitter.com/HNStatus

HN is so reliable that’s it’s almost never needed one. I’m extremely curious how HN survived this; almost positive they used cloudflare at one point.

I think the official status page is @hnstatus on Twitter, or something like that.


They did use Cloudflare, but also haven’t for some time.


HN is reliable until AWS Route53 goes down :).


HN allegedly still runs on one machine running a single-threaded Lisp webserver.



Parts of the site that are behind CF like the API are down.


Is it me, or has this been happening way too frequently for them lately?


To be fair the last major outage they had was 1-2 years ago. That said, when that happened they had two outages in about a month.


Honestly at their scale once a decade would be too frequent. Too many eggs in this particular basket.


Once a decade doesn't seem realistic. At some point you get diminishing returns chasing as many mines as possible.


Appears to be working for me now

; <<>> DiG 9.10.6 <<>> discordapp.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 8092 ;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ;; QUESTION SECTION: ;discordapp.com. IN A

;; ANSWER SECTION: discordapp.com. 140 IN A 162.159.135.233 discordapp.com. 140 IN A 162.159.129.233 discordapp.com. 140 IN A 162.159.130.233 discordapp.com. 140 IN A 162.159.134.233 discordapp.com. 140 IN A 162.159.133.233

;; Query time: 69 msec ;; SERVER: 1.1.1.1#53(1.1.1.1) ;; WHEN: Fri Jul 17 14:37:40 PDT 2020 ;; MSG SIZE rcvd: 137


1.1.1.1 is not resolving anything for me at this time.


For people who reports this is down, which country are you in? Because all the reported sites works flawless from Norway (Europe) :-)


Brazil here, basically everything down.

I noticed a lot of packet loss to 1.1.1.1, not an outright "outage", maybe they were rolling a deployment?

Edit: Looks like a deployment to me (looking at the logs I could see cascading traces, so it took down one DC and the other started responding - increased latency - and then down, etc..), gonna be an interesting post-mortem!


East Coast USA here. Any cloudflare site is unreachable, and 1.1.1.1 is giving me massive latency and packet loss :)


US here, everything reported that I've checked is down. My Cloudflare sites are down as well.


I'm in Holland but everything came back up just now, so maybe you picked the exact right moment to check?


I put my bet on some peering fuckups, causing outrages since people are having packet loss etc.


"Update - This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now. Jul 17, 22:09 UTC" - https://www.cloudflarestatus.com/


Germany, everything down.


Interesting ... Germany here, too, but I didn't see any down sites


Ireland here, and it's all down.


Estonia, everything working fine


> * This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now.* Jul 17, 22:09 UTC


A lot of unusual internal traffic seems to be around Thailand (you might have to select "UNUSUAL"):

https://www.digitalattackmap.com/#anim=1&color=0&country=ALL...


been pretty large numbers from thailand this month according to that tool


Update - This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some potions of the network to not be available. We believe we have addressed the root cause and monitoring systems for stability now.


Their status page says that everything is operational. So much for a status page when half of the internet breaks down.


What I found particularly interesting is that my MacBook Pro (work laptop) didn't start up anymore properly and I wasn't able to start applications ... sorry, but wtf. Now I hate Apple and their shitty, overpriced products even more.


What was interesting and scary is that our monitoring system didn’t notify us. Our email was down because we use cloudflare for DNS and our monitoring provider’s SMS gateway was down. So we didn’t get sms messages.


Another useless status site.

DNS is completely broken.

"All systems operational" in nice soothing green.

No, not so much.


FYI, PagerDuty is not loading!

Time to go back to the drawing board, for a lot of us, to re-assess points of failure.

Edit: many websites are failing to DNS resolve but the services they provide continue to function fine behind the curtain.


This is likely your computer's DNS resolver (if you're using 1.1.1.1 you're down. I'd switch to 8.8.8.8 temporarily. We've had pagerduty alerts coming in since the start (a whole bunch of DNS errors from pingdom) and when I click on the slack link, pagerduty works for me


Interestingly this seemed to only affect resolver service. I use Cloudflare pretty extensively on all my sites, but only in DNS mode (no CDN / proxy). The hosts continued to resolve fine during the outage (following root DNS resolution chain, no recursive resolver involved). I imagine their CDN internally uses their resolver service which explains the outages, and some unrelated 3rd parties who don't use CF on their domain at all still created a hard dependency on CF by using their recursive DNS server.


Those poor Firefox users who enabled DoH


Happy Friday, everyone!


Ironically https://downforeveryoneorjustme.com/ is also down. HA


I'm having real problems with DNS, is this Cloudflare too? They say "All Systems Operational", so maybe not?

Half the damn internet is not currently resolving.


Yes, same here. Changed DNSs to level 3, all better now.


More like the Internet is down.


When you depend on a single company for much of the internet, such things happen :(


Site that use Discord, Linode, Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, Gitlab (502), Fiverr, Upwork, Udemy and many more including 1.1.1.1 dns down. Ref: https://twitter.com/nixcraft/status/1284239374809395200?s=19


Seeing a lot of people mentioning DO, but it has been up for me without any issues (small VPS in SF-2)


My Pagerduty's been blowing up so I tried to go to their dashboard to pause the notifications for now and pagerduty.com is down XD


340ms average latency to 1.1.1.1 and 47% packet loss. Many sites are down. But I guess that's the problem with CDNs.


It's unfortunate that both the primary and secondary cloudflare DNS is down. I just switched my secondary to google.

This allows my internet to "work" during this time, but adds about 1s latency to resolutions. Presumably that's the time it takes my internal DNS resolver to try the secondary.


Considering running your own full resolver like unbound. Then you don't have to rely on a DNS provider like Google or Cloudflare. It's really nice not having the whole internet go down when Google or Cloudflare DNS is down.


DNS seems to be dead, if you have stuff in your cache and the site isn't low-TTL things still kinda work


Yeah this is absolutely killing us right now.


Looks like Digital Ocean is reporting an issue with their upstream provider:

https://status.digitalocean.com/incidents/6wtmldty17g1

As big as this is, any chance a major hub/backbone went down?


I don't use Cloudflare, but I do notice Cloudflare services being down.

Right now, I can't get to my own website (hosted on DigitalOcean, not through Cloudflare), but Oh Dear claims it's up. So I suspect that the problem is closer to me than it is to DigitalOcean (or Cloudflare).


DO uses cloudflare for their DNS... both for digitalocean.com and for their DNS service.


Good to know! That makes perfect sense based on what I saw during the outage. I had no idea.


DO might use 1.1.1.1 (or Argon even) for routing between some of their PoPs


Could be. Things are back now, but I was very surprised that a Cloudflare outage makes it impossible for me to get to my Kubernetes API server.

Hidden dependency revealed.


Aside from this one issue, is switching to 1.1.1.1 a good idea in your guys experience? Right now I just realized I hvae the DNS for my ISP which is probably how they inject bullshit 404 pages full of ads. What is the fastest/best public DNS in your guys experience?


I've been pretty happy with 1.1.1.1 (before now). Might be worth using something like 8.8.8.8 as a backup (Google).


You could also use 9.9.9.9 as your backup, if you're avoiding Google (https://www.quad9.net)


I've been using https://nextdns.io/ - works fast and most importantly blocks a bunch of adds (user configurable), so makes browsing on mobile much nicer.

Ancedotally the interenet seems faster.


For us, it's cloudflare. Our ISP is connected to KCIX, and cloudflare apparently has 1.1.1.1 servers in Kansas City. No other free DNS provider is as quick, for us. 18ms or so RTT, as opposed to the WISP's internal latency of ~10ms. Central Kansas.


Looks like it's back. No longer getting issues with 1.1.1.1 and domains are being resolved!


Not for me, `dig @1.1.1.1 google.com` is returning SERVFAIL still. Their anycast config may be broken in some way (ie. the backends for some regions are down, but still advertising routes)


Resolved as of a minute ago, still having an issue now?


I thought my issue was with Comcast, then I realized I'm using CF's DNS entries for my home network. I removed those 1.1.1.1 entries and some sites are working.

Dang, I'm pretty disappointed in CF. I've never experienced this much an effectful DNS outage.


From what I can see externally this looks like DNS.

I wonder if that includes the roots that Cloudflare operate.


It’s worth mentioning here that 1.1.1.1 is also affected by this outage which initially made me think my internet was gone completely.

Changing back to an alternative (such as 8.8.8.8 from google) restored my access to the areas of the internet not using Cloudflare.


This was especially bad because I use Cloudflare public DNS exclusively at my house and it went down as well. I didn’t even think to check DNS, I just assumed it was AT&T being shiest AT&T.

I should probably run a blend of 1.1.1.1 with 8.8.8.8 instead.


Last I read about 7 million hosts are behind Cloudflare. Maybe around 3% of the web, but who knows if that counts for critical assets etc rather than pages served.

Shameful that so much of our decentralised web is so centralised and breakable in one place.


Just got a whole bunch of alerts that my services are down. Tried logging into Digital Ocean (who it seems uses Cloudflare) to get it fixed. Could not access their dashboard to reroute things.


RIP someone's weekend


Guess that explains Discord vanishing from the net a few minutes ago.


I'm surprised so many people still use them. They took my business down (along with half the internet) a few years ago and I learned that they were to large of a point of failure.


Ah Shit, Here We Go Again


Vercel also appears to be dropping out and coming back in intermittently over the last 30 minutes or so. Not aware they're using Cloudflare, although they do mention using AWS.


NextDNS got taken out by this, id been really happy with it up until now. And unfortunately “dns service went down” has a wide enough blast radius at home now that it’s a real pain.


How did you verify that? I determined the issue was with Cloudflare's DNS by toggling on NextDNS, which worked and continues to.


Most of my devices went belly up and was trying to figure out what it might be (I run NextDNS on my router), switched off to cell and noticed discord was down too so started thinking about NextDNS. I toggled dns to google and noticed it immediately work.


this is great, i already have bad enough internet (rural area with 3 to 6 digits latency and average 4 digits, barely a few kilobytes of speed) and having both google smearing everywhere their recaptchas that are not really friendly toward low speed internet / non chrome users and cloudflare proxying half the internet but lately not really doing a great job at keeping a consistent uptime does not help much

at least i am glad hn exists, it is the only thing that loads everywhere


I was trying to play video games but couldn't connect. Amazing how connected the web is now - one big hub goes down and brings the whole house of cards down with it.


Did anyone else see their ATT internet go down? The DNS issues started and then the Pace 5268AC rebooted. I don't use cloudflare for dns. Does ATT's backend?


On the contrary, ATT actually squats on the CloudFlare DNS IP address. IIRC that modem is one of the affected ones where it uses 1.1.1.0/24 internally. You shouldn't even be able to use CloudFlare DNS normally.


Thank you for the response and information. Unhappy coincidence.


It shows "Minor system outage" when I load the page, but it switches to "All systems operational" immediately. Same behaviour on several attempts.


The status page linked shows "All Systems Operational" for me. Tested in private browsing and on my mobile.

Looks like DNS issues, their nameservers aren't reachable.


A lot of people are saying AWS. I'm having intermittent network connectivity issues intra-AZ, so perhaps they lost a data center or route flapped one.


What's all this about ”building a better internet”? Wide-reaching general service outages that are invisible to your status page are really not great.


I wonder if they'll cover why their status page is a steaming pile of garbage in the post-mortem?


Yesterday CloudFlare took down some of our products because they (not us) misconfigured some DNS thing. Kind of funny to see it happen again a day later.


This is hitting my production environments as well :-(


Twitch.tv channels are like 50/50 right now. Some are ok, some aren't.

Basically all Riot Games (League, Valorant, TFT) are down, dunno about LoR.


That would explain why Patreon is down. I was going to post a little frog I took a picture of on Lens. Went down just as I opened the app.


Hopefully with this outage Cloudflare will finally provide non-Enterprise plans a CNAME record, allowing us to quickly bypass Cloudflare.


Thought I was going crazy for a second.

This affects so many things it's scary, and Cloudflare status page has still not updated. HN got there first.


Interesting. They had an outage in the midst of a negotiation I was a part of. Are they less stable than Akamai and the others?


My modem also disconnected with signal problems, which was interesting. I'm not sure Cloudflare could have caused that?


maybe your modem uses 1.1.1.1 dns?


I was paying a bill in a bar in London about an hour ago and they couldn't process any payments. Seems likely related.


Must be regional or some other factor involved. Various sites others are reporting as offline load for me as does 1.1.1.1.


Oh, did hacker season just start? * Grabbing popcorn! * So this was an accident, or is it connected to the Twitter hack?


Will we take this as a much needed lesson about putting all of the internet's eggs into one basket? Probably not.


WebGazer.io managed to shoot me an email about my site being down. This in spite of their site being down too.


Hi @heliodor, WebGazer founder Gokhan here :)

Actually the site is running but not accessible due to the issue. Glad you got the heads up, after a while I had to pause monitoring to prevent side effects.


Yup. 1.1.1.1 stopped responding as well.


DNS resolution at 1.1.1.1 seems to have gone down and come back up for me in the course of 10-15 minutes.


Also breaking a bunch of deploys because npm and yarn are heavily dependent on Cloudflare, it would seem.


Remind me to check and see that I have 8.8.8.8 and 1.1.1.1 on my networks, not just one or the other..


It's back now (at least for me).


The uptime tool I use (StatusCake) is itself down... Was wondering why I didn't get an alert.


Seem to be localized issue. Cloudflare is up here in my country, but down for many people in US.


People talk about a single point of failure then go on to depend on a single point of failure.


The status page is now showing degraded performance for the Cloudflare API and Recursive DNS.


Cloudflare status site is also partially down. Some resources are not loading properly.


Of course, I read this after I spend an hour debugging some strange DNS issues.


Thank you for the early Friday.


I noticed Udemy was down when I wanted to go to the next video I was watching.


And now it's back up.


I can only see the dashboard down, all my sites with Cloudflare are up.


Looks like it's resolved -- we're coming back up at Repl.it.


Cloudflare DNS is down too


Is this US specific? Everything seems to work fine here in Europe.


Centralising on a single host suddenly not a good idea any more?


Seems like they managed to break half the internet for everyone.


how can I have cloudflare plus something else as a DNS failover? We are afraid to set a long TTL and have our IP changed for some reason. What do you guys recommend?


DNS is also (partially) down with my ISP (xs4all.nl) it seems


Ironic, isitdownrightnow.com is down.

All my DigitalOcean instances are down.


Yep, the DoorDash app is affected currently, my burritos!


Finally we see how much we depend on this single company.


Looks like it's starting to come back in the SE US.


Boats. People. Stop putting them in one single boat.


Back online for me.


Dns issues for sure


Well, sheit. This is all around the world. Press F.


Seems to work again for me in germany (Frankfurt)


League of Legends down too, not sure if related.


Most pages mentioned here seem functional again.


So THAT is why the Internet was acting weirdly!


DNS seems to be resolving for me in the UK now


Back online just now for me in Midwestern US.


2:36 PM PST - status.discord.com is back up.


NPM is down too.


Seems like it's starting to come back.


Was wondering why my DNS wasn’t working...


digitalocean.com DNS (on cloudflare) is now resolving again. looks like several things are coming back now.


I was having troubles with overleaf.com


How does Cloudflare compare to Akamai?


Ironically StatusCake is down as well.


who deployed on a friday afternoon?


Cedexis gets another lease on life


Jesus. Does anyone know anything?


Having issues with gitlab myself


Seems like it’s back right?


Can confirm for taskade.com


Friday


1.1.1.1 is back for me now


League of Legends, Valorant and Discord both down. I took today off to play games...


Felt like a BGP issue.


Looks like CF is up!


Some POPs are fine.


Five nines uptime?


Back online here.


the internet was built

to withstand a nuclear war

brought down by cloud flares


lmao it even took down my local stack


hn algolia search broken


yes hn.algolia.com is powered by Cloudflare /o\


more like Cloudflared


itch.io down

isitdownrightnow.com down


yes


Given that the US is basically in a non-shooting war with China, I wonder if this is something technical or part of some kind of attack. Something that I’d keep in mind.


There are enough ways for bits of the Internet to go kablooey on their own that “it’s an attack!” is a pretty big jump to a conclusion. If this turns out to be something other than Cloudflare tripping over a weird bug, my first guess would be that someone fat-fingered a BGP table yet again.


Update: Yep, BGP issue, though I was thinking it would be something on the public Internet rather than CF’s backbone.


Your username is funny.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: