Discord is entirely down right now, both the website and the app itself. Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first. Downdetector, monitortheinternet, etc.
Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy
Edit: 15 min later, looks like things are starting to come back up
I host a dedicated server there (running https://www.circuitlab.com/) and when I traceroute/ping news.ycombinator.com, it's two hops (and 0.175 ms) away :)
This was a BGP/routing issue and has already been documented. Please don't spread misinformation and hysteria, especially on technical issues like this
It was a misconfiguration we applied to a router in Atlanta during routine maintenance. That caused bad routes on our private backbone. As a result, traffic from any locations connected to the backbone got routed to Atlanta. It resulted in about 50% of traffic to our network to not resolve for about 20 minutes. Locations not connected to our backbone were not impacted. It was a human error. It was not an attack. It was not a failure or bug in the router. We're adding mitigations to our backbone network as we speak to ensure that a mistake like this can't have broad impacts in the future. We'll have a blog post with a full explanation up in the next hour or so — being written now by our CTO.
Don't think so. They used to use Cloudflare but stopped. To my knowledge, it's a single server without a database (using the filesystem as a database).
>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.
Spin up an apache installation and see how many requests you can serve per second if you're just serving static files off of an SSD. It's a lot.
edit: I see that there are already a bunch of other comments to this effect. I think you're comment is really going to bring out the old timers, haha. From my perspective, the "modern web" is absolutely insane.
>* From my perspective, the "modern web" is absolutely insane. *
Agreed.
I was brought up as a computer systems engineer... So, not a scientist, but I always worked with the basic premise of keep it simple. I've worked on projects where we built all the fangled clustering and master/slave (sorry to the PC crowd, but that's what it was called) stuff but never once needed it in practice. Our stuff could easily handle saturated gigabit networks as the 2 core cpu only running at 40%. We had cpu spare and could always add more network cards before we needed to split the server. It was less maintenance, for sure.
It also had self healing so that some packets could be dropped if the client config allowed it, if the server decided it wanted to (but only ever did on the odd dodgey client connection)
That said, I was always impressed by the map-reduce of for search results (yes, I know they've moved on) which showed how massive systems can be fast too. It seemed that the rest of the world wanted to become like Google, and the complexity grew for the std software shop, when it didn't need to imho.
I jumped ship at that point and went embedded, which was a whole lot more fun for me.
There is only so much i can do - i'm not american and thats a change i can't make on my own aside from doing my best to be an ally when possible.
However i can open a few PRs and use some of my time to make that change. It's a minor inconvenience to me and if it makes even one black person feel heard and supported then yea, i'm gonna do it.
>I think you're comment is really going to bring out the old timers, haha.
That is great ! :D
>It's a lot.
Well yes, but HN isn't really static though. Fairly Dynamics with Huge number of users and comments. But still, I think I need to rethink lots of assumption in terms of speed, scale and complexity.
Huge numbers of users don't really mean that much. Bandwidth is the main cost, but that's kept low by having a simple design.
Serving the same content several times in a row requires very few resources - remember, reads far outnumber writes, so even dynamic comment pages will be served many times in between changes. 5.5 million page views a day is only 64 views a second, which isn't that hard to serve.
As for the writes, as long as significant serialization is avoided, it is a non-issue.
(The vast majority of websites could easily be designed to be as efficient.)
There is some caching somewhere as well, probably provides a bit more boost.
I've been at my work laptop (not logged in) and found something I wanted to reply to, so I pulled out my phone and did so. For a good 10 seconds afterwards, I could refresh my phone and see my comment, but refresh the laptop and not see it.
> From my perspective, the "modern web" is absolutely insane.
You know, it should be even better than it was in the past, because a lot of heavy lifting is now done on the client. If we properly optimized our stuff, we could potentially request tiny pieces of information from servers, as opposed to rendering the whole thing.
Kinda like native apps can do(if the backend protocols are not too bloated)
> Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.
It doesn't need to be crazy.
A static site on DigitalOcean's $5 / month plan using nginx will happily serve that type of traffic.
The author of https://gorails.com hosts his entire Rails video platform app on a $20 / month DO server. The average CPU load is 2% and half the memory on the server is free.
The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases. This outage is also an unfortunate reminder that you can have the fanciest infrastructure set up ever and you're still going down due to DNS.
> The idea that you need some globally redundant Kubernetes cluster with auto fail-over capabilities seems to be popular but in practice it's totally not necessary in so many cases
True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).
> True, but this is why it shouldn't be bashed either. When you need it, you need it (cue very complex enterprise applications with SLA requirements).
To support this, look at how many people criticize Kubernetes as being too focused on what huge companies need instead of what their small company needs. Kubernetes still has its place, but some peoples expectations may be misplaced.
For a side project, or anything low traffic with low reliability requirements, a simple VPS or share hosting suffices. Wordpress and PHP are still massively popular despite React and Node.js existing. Someone who runs a site off of shared hosting with Wordpress can have a very different vision about what their business/sideproject/etc will accomplish compared to someone who writes a custom application with a "modern" stack.
We were serving around that traffic off a single dual pentium 3 in 2002 quite happily off IIS/SQL Server/ASP. The amount of information presented has not grown either.
That little box had some top tier brand main corporate web sites on it too and was pumping out 30-40 requests a second peak. There was no CDN.
You were not serving that traffic, you were just serving your core functionality - no tracking, no analytics, no ads, no a/b, no dark mode, no social login, no responsiveness. Are most of those shitty? Sure, just let me know when you figure out how to pry a penny from your users for all your hard work.
That means an average of about 63 pages per second. Let's say that the total number of queries is tenfold and take a worst case scenario and round up to 1000 queries per second and then multiply by ten to get 10k queries per second, because why not.
I don't know what the server's specs are but I'm sure it must be quite beefy and have quite a few cores, so let's say that it runs about 10 billions instructions per second. That means a budget of about one million instructions per page load in this pessimistic estimate.
The original PlayStation's CPU ran at 33MHz and most games ran at 30fps, so about 1million cycles per fully rendered frame. The CPU was also MIPS and had 4KiB of cache, so it did a lot less with a single cycle than a modern server would. Meanwhile the HN servers has the same instruction budget to generate some HTML (most of which can be cached) and send it to the client.
A middle of the line modern desktop CPU can nowadays emulate the entire PlayStation console on a single core in real time, CPU, GPU and everything else, without even breaking a sweat.
>Holy crap I am thinking either there is some magic or everything we are doing in the modern web are wrong.
That's only ~60 QPS, assume it is peaky and hits something more like 1000 QPS in peak minutes, but also assume most of the hits are the front page which contains so little information it would fit literally in the registers of a modern x86-64 CPU.
Even a heavyweight and badly written web server can hit 100 QPS per core, and cores are a dime a dozen these days, and storage devices that can hit a million ops per second don't cost anything anymore, either.
Unlike Reddit, logged in and logged out users largely see the same thing. I wouldn't imagine there is much logic involved in serving personalized pages, when they don't care who you are.
The username is embedded in the page, so you can't do full caching unfortunately. But the whole content could be easily cached and concatenated to header/footer.
That could be split as a setting in the user-specific header with the visibility part handled client-side. A bit more work, but it's not impossible if it's worth it.
It's amazing how little CPU power it takes to run a website when you're not trying to run every analytics engine in the world and a page load only asks for 5 files (3 of which will likely already be cached by the client) that total less than 1 MB.
It is certainly doable , PoF was running lot of page views famously of a single IIS server for a long time .
HN is written in a lisp variant and most the stack is built in-house , it is not difficult to imagine efficiency improvements when many abstraction layers have been removed from your stack .
I don't remember PoF being famous for that, but they got a lot of bang for the buck on their serving costs.
What I do remember, is that it was a social data collection experiment for a couple of social scientists, that never originally expected that many people would actually find ways to find each other and hook up using it.
I miss their old findings reports about how weird humans are and what they lie about. Now, it's just plain boring with no insights released to the public.
For all my sibling comments, there is also context to be aware of. 5.5m page views daily can come is many shapes and sizes. Yes, modern web dev is a mess, but situation is very different from site to site. This should be taken as a nice anecdote, not as a benchmark.
With DO, these days, they don't run me out of bandwidth, but my instance falls over (depending on what I am doing - which ain't much), but with AWS, they auto-scale and I get a $5000 bill at the end of the month. I prefer the former.
Yeah, the bandwidth overage was a grand, give or take. It was a valuable lesson in a number of ways, and why for any personal things I wouldn't touch AWS with a shitty stick.
That's why we have this bloated over-engineered multi-node applications: People just underestimate how fast modern computing is. To serve ~2^6 requests/sec is trivial.
1M queries per day is ~10 queries per second. It's a useful conversion rate to keep in mind when you see anyone brag about millions of requests per day.
I don't get this comment, what does page serving performance have to do with "the modern web"? It's not as if serving up a js payload would make it more difficult to host millions of requests on a single machine, html and js are both just text.
Makes sense based on what I've read about Arc, which HN is written in.
I've been working on something where the DB is also part of the application layer. The performance you can get on one machine is insane, since you spend minimal time on marshalling structures and moving things around.
Only if you use their "infinitely scaling" services. eg. s3. If the attacker is hammering you with expensive queries and your database is on 1 ec2 server, you're still going to go down.
My iPhone actually popped up a message saying that my wifi didn't appear to have internet, which was strange and obviously false as I was actively using the internet on it and the laptop next to it, but now it makes sense that it must have been pinging something backed by cloudflare!
Discord attempted to route me to: everydayconsumers.com/displaydirect2?tp1=b49ed5eb-cc44-427d-8d30-b279c92b00bb&kw=attorney&tg1=12570&tg2=216899.marlborotech.com_47.36.66.228&tg3=fbK3As-awso
It looks like you mistyped it and landed on a domain with spammy redirects. They have all kinds of weird URLs and there's not always any connection to anything you did other than go to the wrong domain.
We're dealing with a deeper level problem here. Since a lot of the internet is relying on Cloudflare DNS at some part or another, even many backup solutions fail.
Since so much of DNS is centralised in so few services, such outages hit the core infrastructure of the internet.
A sudden disruption on a large number of services for everybody at once doesn't look like a DNS problem to me, with all the caching in DNS. It would fail progressively and inconsistently.
DNS absolutely was an issue. I changed DNS manually from Cloudflare's 1.0.0.1 (which is what DHCP was giving me) to 8.8.8.8 (Google) and most things I'm trying to reach work. There may be other failures going on as well, but 1.0.0.1 was completely unreachable.
No, I changed the setting back and forth while it was down to confirm that the issue was that I could not reach 1.0.0.1. All the entries I tried from my host file were responsive (which is how I ruled out an issue with my local equipment initially and confirmed that it wasn't a complete failure upstream -- I could still reach my servers). Changing 1.0.0.1 to 8.8.8.8 allowed me to reach websites like Google and HN, and changing back to default DNS settings (which reset to 1.0.0.1, confirmed in Wireshark) resulted in the immediate return of failures. 1.0.0.1 was not responsive to anything else I tried.
Again, it may not have been the only issue -- and there are a number of possible reasons why 1.0.0.1 wasn't reachable -- but it certainly was an issue.
> I don't use cloudflare DNS but google DNS and got the same problems thant everyone else
Cloudflare is also the authoritative DNS server for many services. If Cloudflare is down, then for those services Google's DNS has nowhere to get the authoritative answers from.
I can live without creepy instant messengers, but its shocking just how much everything else relies on one, central system. And furthermore, why is it always cloudflair?
idk about the UI the redesign with the forced collapses and extra clicks everywhere is annoying when handling a multitude sub domains plus their let's encrypt text entries I'm there mostly for the freebies.
I did a pickup order a few weeks ago when of all things Tmobile SMS went down for 3+ hours. I couldn't go in the restaurant (covid) and I couldn't text them the parking # I was sitting at in a packed parking lot. I got a flood of about 50 texts a few hours later. Sat there for about an hour waiting for a $9 sandwich. I have no idea if they didn't get my order until late, or if they finally realized it was me or what. About 45 minutes in I decided to just give up on the day and take a nap, woke up to a door knock.
Kudos to the people at Discord. Just a few minutes after I got disconnected they already tweeted about the issue. Some minutes later and they have a message in their desktop app confirming it's an issue with Cloudflare. All while Cloudflare's statuspage says there are 'minor outages'.
As a percentage of total traffic, a 'minor' outage for Cloudflare probably equates to a significant outage for a non-trivial amount of the internet.
It will also be especially noticeable to end-users, because sites using Cloudflare are typically high-traffic sites, and so a 'minor' issue that affects only a handful of sites is still going to be noticed by millions of people.
I wonder if they are all using Cloudflare's free DNS stuff or if they're paying for business accounts?
My stuff is on Netlify (for the next week or so) and the rest is on a VPS bought from a local business who isn't reselling cloud resources. I'm kinda glad I moved all my stuff from cloudflare.
Crazily, my local name resolution started failing, because I have these names servers: 192.168.0.99, 1.1.1.1 and 8.8.8.8. The first does the local resolution, but macOS wasn't consulting it because 1.1.1.1 was failing?? Crazy. When I removed 1.1.1.1 from the list, everything started working.
Thought something like this was going on. At first I thought it was my router and restarted everything - to no avail. Glad to see confirmation that it wasn't an issue on my end.
Freenode's IRC servers were down which was unexpected for me. I was expecting old-school communication networks to not have a dependency on Cloudflare.
It really defies the original vision of the internet to have so many services depend on a single company. Almost every news site I was reading dropped off at once. I thought for a second that I lost internet in my own house.
Yes its really odd that core backbone providers can go down and everything works like its supposed to. Even trans-pacific cables can be cut and things will usually work with only increased latency. But there is not much redundancy for many companies at this layer; having redundant DNS providers is I'm sure possible but not something we think about very often, and of course many of the sites that are down are depending on the proxy and DOS mitigation services.
On my home network I use Google as a backup DNS provider so the whole internet didn't go dark for me, but I don't have a backup DNS host for my company's DNS records.
Redundant DNS is possible, but challenging when you're making use of features like geo DNS that don't lend themselves to easy replication via zone transfer.
I imagine most people would never expect something like this to happen, so having a fallback option when Cloudflare has a huge interruption of service like this is just unthinkable.
All the major cloud infrastructure providers have had outages of varying severity at one point or another...it's something you'd want to take into account for, say, a system that remote controls life-critical devices, but likely isn't worth the engineering time and added complexity for a productivity or social app with a small userbase. Working on many of the latter over the years I've generally said "well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."
>"well if {major cloud provider} is down, the internet is going to be all messed up for a bit anyway, so we'll accept the risk of being down when they're down, and reassess whether that keeps making sense as we grow."
This is a very common pattern and falls into the 'nobody got fired for buying cisco/microsoft/intel' trap.
I have two issues with it;
1) It entrenches the largest provider.
You would not extend the same leniency's of outages to the third best cloud provider, this means that people will just keep pushing the monopoly forward. Even if the uptime or service is actually better on another provider.
2) You create a tight coupling of monocultures;
Simply put: You slowly erode the internet. Your site becomes an application in a distributed mainframe operated by a tiny minority of tech companies universally based in the US.
Why is this a problem? I could give moral answers here but I thing pragmatic ones are more convincing..
Giving ownership of the internet to the few gives them the ability to set the rules.
If you're on Amazon's AWS, what's to say they don't inspect your e-commerce systems and incorporate your business logic into their amazon.com shopping experience. They do this to their marketplace and create competing products already[0].
If you're doing really well, why not just drop a few packets here and there? I mean, they wont.. you're paying, right?
Hell, if you do super well they can just change the rules and make it so your services get expensive in the exact way you use them, or even legislate you off the platform entirely.
Probably wont happen, but it's a lot of trust you have to admit, and people shit on Apple for having that kind of power, and Apple is not even competing in the same market as most people on this site.
If you're on google cloud (which, I'm a fan of btw) and you feed an ML model, well, you paid for it but why shouldn't they also have a copy.. after all, it's green to do so!
Their bigdata platform? Google loves data! Feed the beast.
The trust you have to give AWS or other cloud providers is not that different than the trust you give any number of vendors (email service, phone service, etc.). You have a contract with them that says they won’t do those things, and if they ever get caught doing them all the on-premise enterprise they’re spending a lot of effort on getting on board will dry up instantly.
Amazon can copy your business model just fine without looking at your servers. Most of what’s on your servers is probably irrelevant from their perspective.
Agreed, but the real problem is DDoS and nobody seems to know how to globally solve it. Fighting DDoS is expensive, so you see consolidation. It's well and good to live in a tiny farming town but when raiders start attacking every week, those castle walls and guards start to look really appealing.
It's nice that Cloudflare provides their services for free but scrubbing has existed for a long time. With your own address space and an appropriate budget it's not difficult to have Cloudflare/Akamai/AWS announce your IP space with a higher weight than a direct path to your infrastructure. That will give you a little bit more fault tolerance for incidents like these.
That's what we get for externalizing costs. It's not hard to track down sources, but network operators usually let it be, hence the incentives are probably counter-productive.
Agreed, but I think people really underestimated the forces at work that would cause so much consolidation into a couple internet giants.
The original idea was that with the barrier to entry being so low, anyone and everyone could set up their own websites, mail servers, etc.
But with it being so easy to compare and contrast service (i.e. the market being so open), it means that the competitive forces naturally consolidate to a winner-take-all model. If when starting out Cloudflare was just 5% better than the competition, it could have easily taken the vast majority of the mindshare on the internet. Couple that with the fact that there are huge advantages with scale to a business like Cloudflare's, and it's not hard to see how so much of the internet has become dependent on it.
DNS is far less of a single point of failure and more decentralized than cloudflare. Nameservers can and are operated redundantly via simple, resolver-side round-robin scheduling and the TLD servers should have longer TTLs that allow plenty of caching. The rootzone even has anycast thanks to using UDP. Take a moment to look at DoH and laugh.
You can also also register your domain on multiple TLDs.
The "decentralized internet" folks always talk a lot about fighting corporate control. I think they should spend more time talking about resiliency and blast-radius reduction.
> Unlike previous DNS replacement proposals, D 3 NS is reverse compatible with DNS and allows for incremental implementation within the current system.
And the worst is if you try to raise concerns about cloudflare now it get brushed of as "cf already proxy half the internet, if it goes down our stuff will be minor concern".
I don't understand why the big companies don't always have at least two CDN providers, so they can failover to another one if something like this happens.
I know a lot of big companies do, but I am always surprised when you see ones that don't.
This isn't true... you can certainly do redundant dns with automatic failover between providers. Just set up NS records pointing to different providers.
My CRM was nonfunctional. That’s some critical infrastructure for me. And then I’m wondering, is it me or is it my CRM. Turns out it’s door #3 - cloudflare
The point of the status page is so you can point to it for your five nines SLA and go "look? we were only down for one hour". As soon as the money relies on the metric, the metric will reflect the money.
Despite their update, I like how they're saying only their recursive DNS had "degraded performance", while authoritative is "operational". The entire reason everything blew up was because their authoritative nameservers weren't responding.
I mean, I guess your entitled to look at it that way, but I don't think it's dishonest of them to distinguish between "nothing is working" and "some regions aren't working".
Ahh I remember when AWS went down (think it was 2 years ago now?) or at least a data center in us-east? Majority of the internet went down and status page went down as well. Man good times.
Status pages are a marketing channel not a channel for developers most of the time. It most likely has to go through some layers before someone updates the status page.
I don't think it's just Cloudflare; I just had a fun 10 minutes seeing servers start flipping on my Server monitoring service[1]. This has only happened once or twice per year, and is usually due to weird global DNS issues.
(To give an update, I'm seeing from my monitoring systems (about 15 points around the globe) sporadic outages for Microsoft, Apple, Reddit, Bing, Node.js, Twitter, Yahoo, and YouTube. And my own servers (not behind CF at all) are also flipping up and down. It started around 21:14 UTC.)
Eh, what? There are many good reasons to have low TTL DNS...this exact outage being one of them. Update your records to go direct to your servers, and not through Cloudflare and bam you’re back up. Doesn’t work if your TTL is 86400
Doesn't help as cloudflare wants you to host their name servers with them, so you can't flip any records if the DNS itself is in trouble, like it is now
And changing DNS servers often takes many hours (or days, if .net is involved apparently)
It was interesting that we saw our domains affected from the USA but from Mexico everything looked OK.
The crazier thing is that I tried to login to our CloudFlare account, it never sent me the 2FA code... I still haven't been able to login (Enterprise account)
We were down (downforeveryoneorjustme.com) completely, but back up now (as of a few minutes ago). Our domain wasn't even resolving; we use Cloudflare for frontend and DNS.
We had a surge of people checking if Discord was down on our site, then I noticed everything went down shortly after. Discord is still the top check right now.
I can't ever remember hitting these kind of traffic numbers before.
Funny, I tried to use your site because the website I was trying to access stopped working. But your site was also down so I figured it was just my internet being cranky :/
Thanks! Yep, we have a lot of things on the todo. We want to add more user-focused / location-based outage information since our site is still too reliant on simple HTTP checks to report downtime. This is especially a problem with a Discord outage, for example, where the frontend website is not down, but there might be problems with the API, apps, or other components.
And I'd like to be able to have our site communicate outages like this Cloudflare one, where more than one site might be affected by a larger provider. Automating that is difficult.
This is still a side project, though, so I mostly work on it when I get the urge :)
That would be pretty interesting. Being able to drill down into individual pieces of a stack would be very informative, especially from a parent source. I bet a lot of services would even have that information readily available in their documentation for APIs.
How comfortable are you with open source? If you were willing to release your stack on Gitlab/Github, it might be worth your while.
Something’s wonky, because it’s not just Cloudflare. One of my personal sites is down that uses nothing but a VPS, and I noticed my Unifi AP disconnect from its controller a little bit ago. Fiber cut? Routing issues?
Huh. My Ubiquiti was reporting WAN link down during this outage. I'm using ATT fiber. I'm wondering if "link down" doesn't mean what I think it means. Now that I check, it says "WAN iface [eth2] transition to state [inactive]". I'm wondering if that means link down or if it's doing service checking.
I actually have a WAN2 configured but not plugged in and it was set to "Load Balancing: Failover Only" ... I wonder if all of my 'connection issues' were software assuming my network link is down and switching interfaces to an unavailable one.
to reply to myself, if you have a second interface configured for failover, it actually tests against ping.ubnt.com. I bet every single time my ATT fiber has "gone out" for a minute or two at a time, it's been bogus.
root@USG-PRO-4:~# show load-balance watchdog
Group wan_failover
eth2
status: Running
pings: 2
fails: 0
run fails: 0/3
route drops: 0
ping gateway: ping.ubnt.com - REACHABLE
eth3
status: Waiting on recovery (0/3)
failover-only mode
pings: 1
fails: 1
run fails: 3/3
route drops: 1
ping gateway: ping.ubnt.com - DOWN
last route drop : Fri Jul 17 17:32:58 2020
We can't keep going on like this. The vulnerability of centralised internet infrastructure is a huge problem for everyone. Somebody, somewhere, really ought to sort it all out
10-20 minute router misconfigurations and subsequent fixes are sometimes a fact of life. big network infrastructure is complicated, and sometimes the best laid route tables of mice and men do go abloop and die.
Outages happen no matter what the infrastructure is. There's no solution, they're just something you need to recognize and handle, which Cloudflare seemingly did relatively quickly here.
I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?
I definitely agree in concept with you, but then i think back to how frequently script kiddies took down sites ~10 years ago, or w/e. I feel like what has changes is the massive CDNs in front of so many sites.
So while i do want a better solution, i'm not sure what it looks like. Thoughts?
Does it really matter? If you're small, who cares if you go down for half an hour? What, you'll make $0.02 this hour instead of $0.05? If you're big, you can afford your own infrastructure. Stick a few servers in a few colos around the world and you'll have better uptime than CF and friends anyway.
> I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?
Reddit/HN/etc will send all users to the same URL. Almost all of those users will come without any pre-existing cookies. Serving the same content to all those users should not be impossible for most sites without CF or a CDN.
a) complexity: trick your servers into doing something hard
b) volumetric: overwelm your servers with a lot of traffic
c) volumetric part two: overwelm your servers with a lot of requests, so you respond with a lot of traffic
A and C are things you can work on your self --- try to limit the amount of work your server does in response to requests, and/or make resource consuming responses require resource consuming requests; and monitor and fix hotspots as they're found.
B is tricky, there's two ways to solve volumentric attack; either have enough bandwidth to drop the packets on your end, or convince the other end to drop the packets (usually called null routing). Null routes work great, but usually drop all packets to a particular destination IP, which means you need to move your service to another IP if you want it to stay online; that's hard to do if your IP needs to stay fixed for a meaningful time (TTL for glue records at TLDs is usually at least a day); and IP space is limited, so if your attackers are quick at moving attacks, you could run out of IPs to use. Some attacks are going above 1 Tbps though, so that's a lot of bandwidth if you need to accept and drop; and of course, the more bandwidth people get so they can weather attacks, the more bandwidth that can be used to attack others if it's not well secured.
I'm not very familiar with DDoS protection strategies. Can you please elaborate on what is meant in (c) by "make resource consuming responses require resource consuming requests"?
Make people login before doing a search is a common example for forums. Search is hard, unauthenticated search will bring low end forums down, so they make you create an account and login.
Despite them actually mixing it into their entropy pools, the lava lamps are still entirely for show. The noise of the camera sensor itself is going to contribute orders of magnitude more entropy than the slow movement of the lamps. It's not completely a fake stunt, but it's certainly headline-optimized.
> Appears that the router in Atlanta announced bad routes (effectively a route leak). Only impacted our backbone. Not all of our PoPs are connected to our backbone, so some would not have seen an issue. Appears to have impacted about 50% of our traffic for a bit over 20 min.
I personally find sites like https://outage.report or https://downdetector.com which tally up the number, regions and history of people saying it isn't working for them more conclusive.
That page still shows "Cloudflare System Status: All Systems Operational" for me, but it's definitely down for me. Along with 1.1.1.1, which is... bad.
Same. Even then, a bunch of sites are down. Maybe only ones behind Cloudflare? So far I've been trying to hit the various down detector sites and none of them will load. Google, Reddit, Hackernews are all fine.
Same. Had to go through the entire troubleshooting process --- is it my internet connection? my DNS resolver? firewall? ISP starting to filter DoT queries somehow?
Only last in my mental list was the possibility that Cloudflare would be down.
Hope they publish a detailed post-mortem. It's always fun to read (but certainly very painful for those directly involved in writing it).
Monday Morning RCA: "We pushed out some routine code updates, but this really weird thing happened causing a resource utilization spike on our DNS systems. Because of this other really weird thing, this affected all of our global infrastructure simultaneously. Here's a deep engineering dive into this one weird thing that brought everything down."
Cloudflare's DNS (1.1.1.1) is failing to respond to most/all queries, which I'm observing as the root cause of a bunch of connection issues (name lookup failure).
Interestingly the same domains don't show up on google's (8.8.8.8) DNS at all.
Lol, talk about timing. I'm currently working on a TLS library and was pulling my hair out trying to figure out why tests against CF sites suddenly failed. Can't even ask my cohorts on Discord because they are behind CF, too!
"StackOverflow devs have the most difficult job in the world. After all, when StackOverflow is down, they can't exactly look for help on StackOverflow".
No kidding! We had literally deployed a major page redesign and started watching our analytics drop off on it's way to zero. My heart is racing still. I wouldn't normally be happy for a cloudflare outage but in this case it's better than Google deciding to remove us from their index.
Reminder for firefox users: Firefox uses DNS over HTTPS and the default is cloudflare. If you're having DNS issues, you need to disable it until cloudflare is back up.
Good DNS practice (at least when I did system admin 10 years ago) was ALWAYS having a secondary at some other location/network. Why do we just put some info in Cloudflare and call it good these days?
It's hard to use Cloudflare as a reverse proxy without using them as your delegated name servers (maybe you can use CNAMEs on paid plans?), and fancy dynamic nameservers make it hard to run secondary servers with zone transfers.
What luck, I chose today to install a new piece of network gear. I thought I had managed to totally FUBAR my network. DNS was failing, "ping 1.1" (my current goto test "Am I connected to the internet?" as it requires the fewest keystrokes and hits the Cloudflare DNS 1.0.0.1) failed and I just assumed it was my fault. Backed out my changes, and discover in fact, the internet was down.
Yeah I was waiting for it to fix then tried cell phone and realized that was down too. I assumed it was an issue regional / backbone routing or something. Especially cause status pages which I wouldn't expect to be hosted on AWS (because of the need for status pages to stay up when AWS goes down) seemed to also be down. Didn't realize it could be Cloudflare...
Ditto except visa versa. My machine is set to the router which uses cloudflare. Other machines use whatever is default for mac(I try not to touch those). Once I realized they were working and I could access internal network from outside, I started diagnosing DNS. Came here via 8.8.8.8.
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;discordapp.com. IN A
;; ANSWER SECTION:
discordapp.com. 140 IN A 162.159.135.233
discordapp.com. 140 IN A 162.159.129.233
discordapp.com. 140 IN A 162.159.130.233
discordapp.com. 140 IN A 162.159.134.233
discordapp.com. 140 IN A 162.159.133.233
I noticed a lot of packet loss to 1.1.1.1, not an outright "outage", maybe they were rolling a deployment?
Edit: Looks like a deployment to me (looking at the logs I could see cascading traces, so it took down one DC and the other started responding - increased latency - and then down, etc..), gonna be an interesting post-mortem!
"Update - This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now.
Jul 17, 22:09 UTC" - https://www.cloudflarestatus.com/
> * This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now.* Jul 17, 22:09 UTC
Update - This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some potions of the network to not be available. We believe we have addressed the root cause and monitoring systems for stability now.
What I found particularly interesting is that my MacBook Pro (work laptop) didn't start up anymore properly and I wasn't able to start applications ... sorry, but wtf. Now I hate Apple and their shitty, overpriced products even more.
What was interesting and scary is that our monitoring system didn’t notify us. Our email was down because we use cloudflare for DNS and our monitoring provider’s SMS gateway was down. So we didn’t get sms messages.
This is likely your computer's DNS resolver (if you're using 1.1.1.1 you're down. I'd switch to 8.8.8.8 temporarily. We've had pagerduty alerts coming in since the start (a whole bunch of DNS errors from pingdom) and when I click on the slack link, pagerduty works for me
Interestingly this seemed to only affect resolver service. I use Cloudflare pretty extensively on all my sites, but only in DNS mode (no CDN / proxy). The hosts continued to resolve fine during the outage (following root DNS resolution chain, no recursive resolver involved). I imagine their CDN internally uses their resolver service which explains the outages, and some unrelated 3rd parties who don't use CF on their domain at all still created a hard dependency on CF by using their recursive DNS server.
It's unfortunate that both the primary and secondary cloudflare DNS is down. I just switched my secondary to google.
This allows my internet to "work" during this time, but adds about 1s latency to resolutions. Presumably that's the time it takes my internal DNS resolver to try the secondary.
Considering running your own full resolver like unbound. Then you don't have to rely on a DNS provider like Google or Cloudflare. It's really nice not having the whole internet go down when Google or Cloudflare DNS is down.
I don't use Cloudflare, but I do notice Cloudflare services being down.
Right now, I can't get to my own website (hosted on DigitalOcean, not through Cloudflare), but Oh Dear claims it's up. So I suspect that the problem is closer to me than it is to DigitalOcean (or Cloudflare).
Aside from this one issue, is switching to 1.1.1.1 a good idea in your guys experience? Right now I just realized I hvae the DNS for my ISP which is probably how they inject bullshit 404 pages full of ads. What is the fastest/best public DNS in your guys experience?
I've been using https://nextdns.io/ - works fast and most importantly blocks a bunch of adds (user configurable), so makes browsing on mobile much nicer.
For us, it's cloudflare. Our ISP is connected to KCIX, and cloudflare apparently has 1.1.1.1 servers in Kansas City. No other free DNS provider is as quick, for us. 18ms or so RTT, as opposed to the WISP's internal latency of ~10ms. Central Kansas.
Not for me, `dig @1.1.1.1 google.com` is returning SERVFAIL still. Their anycast config may be broken in some way (ie. the backends for some regions are down, but still advertising routes)
I thought my issue was with Comcast, then I realized I'm using CF's DNS entries for my home network. I removed those 1.1.1.1 entries and some sites are working.
Dang, I'm pretty disappointed in CF. I've never experienced this much an effectful DNS outage.
This was especially bad because I use Cloudflare public DNS exclusively at my house and it went down as well. I didn’t even think to check DNS, I just assumed it was AT&T being shiest AT&T.
I should probably run a blend of 1.1.1.1 with 8.8.8.8 instead.
Last I read about 7 million hosts are behind Cloudflare. Maybe around 3% of the web, but who knows if that counts for critical assets etc rather than pages served.
Shameful that so much of our decentralised web is so centralised and breakable in one place.
Just got a whole bunch of alerts that my services are down. Tried logging into Digital Ocean (who it seems uses Cloudflare) to get it fixed. Could not access their dashboard to reroute things.
I'm surprised so many people still use them. They took my business down (along with half the internet) a few years ago and I learned that they were to large of a point of failure.
Vercel also appears to be dropping out and coming back in intermittently over the last 30 minutes or so. Not aware they're using Cloudflare, although they do mention using AWS.
NextDNS got taken out by this, id been really happy with it up until now. And unfortunately “dns service went down” has a wide enough blast radius at home now that it’s a real pain.
Most of my devices went belly up and was trying to figure out what it might be (I run NextDNS on my router), switched off to cell and noticed discord was down too so started thinking about NextDNS. I toggled dns to google and noticed it immediately work.
this is great, i already have bad enough internet (rural area with 3 to 6 digits latency and average 4 digits, barely a few kilobytes of speed) and having both google smearing everywhere their recaptchas that are not really friendly toward low speed internet / non chrome users and cloudflare proxying half the internet but lately not really doing a great job at keeping a consistent uptime does not help much
at least i am glad hn exists, it is the only thing that loads everywhere
I was trying to play video games but couldn't connect. Amazing how connected the web is now - one big hub goes down and brings the whole house of cards down with it.
Did anyone else see their ATT internet go down? The DNS issues started and then the Pace 5268AC rebooted. I don't use cloudflare for dns. Does ATT's backend?
On the contrary, ATT actually squats on the CloudFlare DNS IP address. IIRC that modem is one of the affected ones where it uses 1.1.1.0/24 internally. You shouldn't even be able to use CloudFlare DNS normally.
Yesterday CloudFlare took down some of our products because they (not us) misconfigured some DNS thing. Kind of funny to see it happen again a day later.
Actually the site is running but not accessible due to the issue. Glad you got the heads up, after a while I had to pause monitoring to prevent side effects.
how can I have cloudflare plus something else as a DNS failover? We are afraid to set a long TTL and have our IP changed for some reason. What do you guys recommend?
Given that the US is basically in a non-shooting war with China, I wonder if this is something technical or part of some kind of attack.
Something that I’d keep in mind.
There are enough ways for bits of the Internet to go kablooey on their own that “it’s an attack!” is a pretty big jump to a conclusion. If this turns out to be something other than Cloudflare tripping over a weird bug, my first guess would be that someone fat-fingered a BGP table yet again.
Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy
Edit: 15 min later, looks like things are starting to come back up