Hacker News new | past | comments | ask | show | jobs | submit login
AWS EC2/RDS Outage in us-east-1 (amazon.com)
212 points by jacobwg on Aug 31, 2019 | hide | past | favorite | 146 comments



Just wanted to add a quick note before we get the usual deluge of "you should be running in multiple AZs and regions" posts: These outages are relatively rare and your best decision might just be to accept the tiny amount of downtime and keep your app simple and inexpensive to run.

I of course don't know the tradeoffs involved in running your system, but I know for a lot of my situations the simplicity of single AZ with a straightforward failover option is usually the right tradeoff.


This should always be a business calculation but I do think you should note that there’s at least an order of magnitude difficulty increase between multiple AZs and regions, especially if you’re using services like RDS where it’s designed in, so I’d consider that a solid bridge step.

I’m trying to put some numbers into that, I’ve been running a relatively well trafficked website in multiple AZs since 2011. We had ~20 minutes of downtime when they had a network routing issue for us-east-1 and a few hours of degraded service when S3 had a region-wide outage. I haven’t added up the number of single AZ outages during that period but based on the RSS feeds I think it’s a good bit more relative to the very modest additional cost.


Good point, the multi-AZ RDS feature is a nice way to get most of the resilience upsides without any additional app complexity. You do double your database cost, but that might be worth it.


Not necessarily. You could keep the reader a smaller size and scale it up only if needed.


"dual az" is a checkbox that doubles your cost for transparent failover; it's different from the read only replica


That’s assuming that you a) religiously test with the smaller size and b) are comfortable that scaling up will work when lots of other people are shifting workloads, too. I usually work on projects where we haven’t wanted to deal with that but that’s a judgement call.


Being on AWS is also easy to explain to customers about downtimes - AWS was down and customers are pretty understanding in that case and don't demand why you aren't multi AZ etc ( of course YMMV based on sensitivity of your business)


Kind of like the "no one ever got fired for recommending IBM", if you have significant downtime on Linode or Hetzner, people are going to ask "why weren't you on AWS?!". If you're on AWS and AWS goes down, you get to skip that question entirely. You were already using the logical choice, you don't need to defend anything.

If Netflix can go down when AWS goes down, so can your app. AWS outages impact so much of the Internet that people will just accept it.

Of course here I am with a site on Heroku which uses AWS... impacted by the AWS outage... fielding questions about why I didn't pick AWS if Heroku suffers outages like this. Can't please them all.


Great point.


Netflix is multi region and even when a full region went down they were still fine.

So your statement relies on customer ignorance.


Yep. I call it the "AWS Chicken Pox Party."

We got lucky this time, RDS, ELB, EC2 and Lightsail instances all in US-East-1 across multiple accounts and no issues (knock-on-wood and understand that we'll get it the next time). Especially happy as I had a two-day running neural net training job running and it's still going, that would have been depressing. Phew.


Why don't you just checkpoint the model every n steps? NNs fail for a myriad of reasons, you can easily reduce risk by routinely saving state.


After my last oom-party, I now have it checkpointing every 1000 steps (way too often, I think, but there's plenty of disk), but I just really really want it to complete a full run. ;)


For us this is exactly the correct approach. We could have spent millions of dollars and thousands of man hours hardening things to be resilient to single region outages. But for what? We aren't GE or Google. If our conference line goes down for 2-3 hours per year because we don't have apocalypse-proof infrastructure, literally nothing bad happens to our business. In this exact outage we are discussing, all of my coworkers are at home having breakfast with their families and doing various weekend activities. No one but me will know there was even a problem until I log into the AWS console and find the alerts. Worst case, I have to reboot or restore a few affected instances on Tuesday morning.

It seems like a lot of businesses are chasing this ideal of perfect and end up much worse off than if they had just stuck whatever application on a single server in a semi-reliable part of the world.


2-3 hours per year is a lot of downtime. Most competent bare metal providers see maybe one major outage of less than an hour every 3-5 years. Nothing other than a facility wide power outage, if the load somehow gets dropped because the generators don't start right away as they should, or a misbehaving (only partially failing) core network infrastructure device should result in major outages when all the proper redundancies are in place.

Specific providers aside, there's more complexity involved in a large cloud provider's infrastructure and much more that can go wrong as a result. Having a code update, or some orchestration issue from your infrastructure provider be potential points of major outages are huge and unnecessary risks. You don't need that much scale, just utilizing enough resources to fill up a few whole physical machines for a few hundred dollars a month. Add some globally distributed BGP Anycast DNS and database replication and you have enough redundancy to withstand most of the worst major infrastructure failures.

I would understand if AWS was super simple and convenient, but these days the learning curve seems far greater than setting up the above described bare metal solution. While being almost an order of magnitude more expensive for the equivalent amount of resources.

How did we end up here? Does brand recognition just trump all technical and economic factors, or what am I missing?

Disclaimer: I run a bare metal hosting provider


> Having a code update, or some orchestration issue from your infrastructure provider be potential points of major outages are huge and unnecessary risks.

I trust any of the big cloud providers to do these things more reliably than I can. In particular, if I'm going to replicate a database across data centers within a region (availability zones as the big cloud providers call them), I'm quite sure that a managed database service will be more reliable than my own hand-configured cluster.


Is 3 hours per year really that bad? It’s 99.96% (three 9’s) uptime which I’d think is fine for most small to medium businesses.


From a dedicated hosting provider's perspective, some customers will notice interruptions more than 5 minutes (1 monitoring cycle), start submitting trouble tickets at 15 minutes, and require RFO's at 30+ minutes of downtime. An hour and up even just one time, we would probably start to see cancellations. Not going to say I'm not envious that AWS seems to have a much more outage tolerant customer base.


I don't want to advertise your particular company, but if we are talking about numbers, how does your bare metal offer compares to a Amazon ec2 offer for example? And how would a customer that need to scale their load do it?


A c4 dedicated host at AWS (simply picked as the 2nd listed in the Dedicated Hosts Configuration Table as c3 the first doesn't show up under the Instance Types page) comes with an E5 2666v3, 64GB of RAM, and no storage for $810/mo.

The CPU model is non-standard, but at 10 cores and 2.90GHz is effectively a slightly higher clocked version of the E5 2660v3 (10 cores, 2.60GHz). The first google result for a 2660v3 dedicated server with an order page that allows to adjust the options (13 usable IP's, 64GB RAM, minimal 120GB SSD storage) comes out to $275.

And this is based on whole box to whole box comparison. The cost of individual instances at AWS equivalent to one of those boxes can be much higher depending on the type and size.


Don’t mind me. I’m just here to corroborate your claims of downtime as a consumer of bare-metal hosting providers for 15-something years.


Did you ever faced a situation where you or your clients needed more compute power and a cloud scalability would've been more convenient/cheaper?


No, a datacenter holds a seriously large amount of compute. Just a single rack is 38U usable in most cases, depending on density of compute and power availability you can get a good 2,000 CPU cores and a few dozen TiB of DDR4 from a single rack (with something like a DELL MX7000 chassis).

And it's incredibly rare you'd be limited to a single rack of course.

Cloud has many tangible benefits, but "amount of compute available" is not one of them. Time to acquire compute, though, is. (and, obviously, management of the resources/datacenter operations).

Cloud is almost never cheaper, even factoring in salaries. It's just very convenient if you're small enough not to have people providing compute very well internally. (and, internally, people tend to understaff/underfund the teams that would do the same job as cloud operators are doing)


Cloud can often be cheaper than on prem or at a colo if you are both willing to be “cloud native” and change your processes and you have people who actually know what they are doing and not a bunch of “lift and shifters” who are old school netops guide, got one AWS certification and now only know how to click around in the UI and duplicate an on prem infrastructure.


Maybe if your load is unusually extremely erratic. In the vast majority of cases, you could purchase 2-3x more than you need in bare metal hosting resources (with data centre and hardware operations already outsourced), making scaling not an issue, and still see significant cost savings compared to public cloud which is typically 6-7x the cost for equivalent resources.


If all you care about is running a bunch of VMs yes. But if you’re just running a bunch of VMs and using cloud hosting - you’re doing it wrong.

- we don’t want to maintain our own build servers. We just use CodeBuild with one of the prebuilt Docker containers or create a custom one. When we push our code to Github, CodeBuild brings up the Docker container, runs the build and the unit tests and puts the artifacts on S3.

- We don’t want to maintain our own messaging, sftp, database, scheduler, load balancer, oath, object storage, servers etc. We don’t have to, AWS does that.

- We don’t want to manage our own app and web servers. We just use Lambda and Fargate and give AWS a zip file/docker container and it runs it for us.

- We need to run a temporary stress test environment or want to do a quick proof of concept, we create s Cloud Formation template, spin up the entire environment, run our tests with different configurations and kill it. When we want to spin it up again, we just run the template.

We don’t have to pay for a staff of people to babysit our infrastructure between our business support contract with AWS and the MSP, we can get all of the troubleshooting and busy work support as needed.

I’m a software engineer/architect by title, but if you look at my resume from another angle, I would be qualified to be an AWS architect. I just don’t enjoy the infrastructure side that much.


>I just don’t enjoy the infrastructure side that much.

That's fair, that's totally your right.

However, you're talking about absolute cost and unfortunately you're examples weave through true and false quite frenetically.

> - we don’t want to maintain our own build servers. We just use CodeBuild with one of the prebuilt Docker containers or create a custom one. When we push our code to Github, CodeBuild brings up the Docker container, runs the build and the unit tests and puts the artifacts on S3.

Like all things business, "want" and "cost" are different, in this case, depending on your size of course, it could easily be cheaper to have a dedicated "build engineer" maintaining a build farm. This is how the majority of people do it. (I work in the video games industry, it's _MUCH_ cheaper to do it this way for us)

> - We don’t want to maintain our own messaging, sftp, database, scheduler, load balancer, oath, object storage, servers etc. We don’t have to, AWS does that.

Again, those are "wants", TCO can be much lower when out of the cloud. But again, depends on scale. (as in, lower scale is cheaper on cloud, not larger scale).

> - We don’t want to manage our own app and web servers. We just use Lambda and Fargate and give AWS a zip file/docker container and it runs it for us.

I mean, 1 sysadmin can automate/orchestrate literally thousands of webservers.

>- We need to run a temporary stress test environment or want to do a quick proof of concept, we create s Cloud Formation template, spin up the entire environment, run our tests with different configurations and kill it. When we want to spin it up again, we just run the template.

Yes, this is a real strength of cloud.

> We don’t have to pay for a staff of people to babysit our infrastructure between our business support contract with AWS and the MSP, we can get all of the troubleshooting and busy work support as needed.

Yes, but you are paying "overhead" for all of that, and not having talented engineers on your payroll who understand your business critical systems is, in my opinion, foolish.

I've dealt with vendor support and it's incredibly hit and miss, and it's much more "miss" when you're a smaller customer to the vendor. Of course, this is anecdotal.


> depending on your size of course, it could easily be cheaper to have a dedicated "build engineer" maintaining a build farm. This is how the majority of people do it. (I work in the video games industry, it's _MUCH_ cheaper to do it this way for us)

Cheaper to have a dedicated build engineer than using Codebuild? I just looked at my bill for August, my startup has $50K/mo. AWS spend across 4 regions in US/EU/Asia. We use Codebuild to build and deploy all of our infra from GitHub, including a ton of EC2 for our dedicated apps.

Guess how much my bill for Codebuild was in August? 17 cents! $0.17 CodeBuild $0.06 Asia Pacific (Singapore) $0.07 Asia Pacific (Tokyo) $0.02 EU (Frankfurt) $0.00 EU (Ireland) $0.02 US West (Oregon)

I'd like to see you hire a build engineer for $0.17. AWS services are dirt cheap because they let you automate all of the stuff that would require dedicated engineers for, while you can focus on your business, or what differentiates you.


I’m not disagreeing with you. The only times you save money by going to the cloud is by reducing the number of people you need or if you have a lot of elasticity in demand. I would never recommend anyone going to the cloud just to reproduce an infrastructure they could do at a colo.

Like all things business, "want" and "cost" are different, in this case, depending on your size of course, it could easily be cheaper to have a dedicated "build engineer" maintaining a build farm. This is how the majority of people do it. (I work in the video games industry, it's _MUCH_ cheaper to do it this way for us)

That’s $80K to $100K. You can buy a lot on AWS for that price...

Again, those are "wants", TCO can be much lower when out of the cloud. But again, depends on scale. (as in, lower scale is cheaper on cloud, not larger scale).

That’s another $100K to $200K....

I mean, 1 sysadmin can automate/orchestrate literally thousands of webservers.

That’s yet another $100K. You’re up to at least $250K - $500K In salaries.

Yes, but you are paying "overhead" for all of that, and not having talented engineers on your payroll who understand your business critical systems is, in my opinion, foolish.

I am one of the “talented engineers” that’s why I mentioned I could go out and get a job tomorrow as an AWS architect - my resume is very much buzzword compliant with what it would take from both a development, Devops, and netops side to hold my own in a small to medium size company. I just find that side of the fence boring - we outsource the boring work or the “undifferentiated heavy lifting”.

As that side got too much for me, we hired one dedicated sysadmin type person to coordinate between what he would do himself, our clients and our MSP.

I’m actually trotted our as the “infrastructure architect” to our clients even though my official title and day to day work is a developer. I haven’t embarrassed us yet.

I've dealt with vendor support and it's incredibly hit and miss, and it's much more "miss" when you're a smaller customer to the vendor. Of course, this is anecdotal.

I agree completely. If it’s something complex, I either do it myself or have very detailed requirements on what our needs are. But honestly, the more managed services you use, the less you have to do that part.


I'll note that he says he works in the video game industry, where infra engineer/sysad/etc salaries are, in my anecdotal experience, significantly lower than even run of the mill positions in "regular" companies, and especially lower than SV/startup/big tech companies. Offers I've received from several game companies were less than 50% of what I had from other companies, and I was told when I tried to negotiate that they wanted people that were passionate about games and what they were building, and not people just looking for a cushy job. That can change the economics on the situation.


I’m not even coming from the perspective of a Silicon Valley big tech company. I’m in Atlanta. We would need at least three additional employees to handle our infrastructure/dev ops workload at a colo and that would be around a half a million dollars for the fully allocated cost to hire them. As opposed to the additional cost of the business support plan, the cost of the MSP, and hiring senior developers who are know their way around AWS. Also we finally hired one person to coordinate everything and take the busy work off the back of the leads.

We can do a lot with a half million dollars a year on AWS.


To add a little context I currently spend around 500,000 CAD on infra in GCP per month, which is roughly half of my total infrastructure (in terms of raw compute/bandwidth use). The remaining metal costs 100,000 CAD / month.

As I was implying. You’re just outsourcing your ops. At scale, you end up spending significantly more than you expect.


That’s the difference. Whether you operate at a small scale or a large scale, if you have web servers, database servers, load balancers, build servers, network infrastructure etc. If you are at a colo, you still have a minimum number of people you have to hire and no one who is any good is going to work below market rates and if they are any good, they would probably be bored out of their minds at a small company. “Outsourcing your ops” makes perfect sense until it doesn’t.

Also, when I put my software architect hat on (and take my infrastructure hat off), it’s a lot quicker to get things done just to ask our MSp to open an empty account in our AWS Organization, spin up the entire infrastructure, pilot it, get it approved, audited, and then run the same template in the production account without having to wait on a change request, approvals, pre approval security audits, etc.

I’m also not advocating all in on cloud. With a Direct Connect from your colo to your cloud infrastructure, it makes sense sometimes to have a hybrid solution. Everything from using your cloud infrastructure as a cold DR standby, using it for green field development where a team doesn’t need to be shackled by change requests, committees, etc


I think we’re agreeing but we draw the line in different places.

Cloud is great for speed of deployment. But once something is made, stable and has predictable load then it’s a huge cost saving to bring it in-house. Many don’t, probably because they’ve used some cloud only technologies or fear the migration path will take time.

So you just continually line bezos pockets instead of using the cost savings to remain liquid.


3 hours of downtime per year equals to 99,96% uptime.

In what world is that a lot of downtime?


Reliability is weird, you're only as reliable as the sum of all your critical components.

Usually you strive for "five 9's" in infrastructure, obviously there's a lot of wiggle room depending on business case. But reliability for individual components gets exponentially harder with each 9 after the first 2.

99.96% uptime of a datacenter is shockingly low, taking connection issues into account (IE; number of successful inbound packets vs unsuccessful ones, not just served requests). For context my company has around 15 datacenters around the world which routinely hit 5-9's, with only a few issues of datacenters being down for 2-3 minutes during a particularly bad ISP outage.

The overwhelming majority of degradations are ones related to bad code being deployed. But since reliability is a sum of all components availability it follows that permitting more outages is less preferable. Especially since they affect all or at least the majority of components in a given region.


In a world where you have SLAs with your customers, in which you commit to something better?


Damn, these ships must really be run tightly.

In every company I have worked for, the amount of outages caused by bugs and other post deployment issues was already above that number.


Multi AZ is actually fairly easy to do, only difficulty is in some special cases. Multi region is a bit harder due to higher latency.

IMO you should do everything you can as multi AZ and if there are some services that are harder to do and you don't need it, then put them in a single AZ.

Thing is that if you keep everything in a single AZ it will be much harder to change when this requirement become important.


If your business is so complicated and mature that it takes millions of dollars to build multi-region tolerance, you probably need that tolerance.

For more simpler sites, having multi-region failover (even if it's a manual failover and you lose a few in-flight transactions) is much easier to build.


An occasional outage is sometimes good for an app (depending, of course, on how mission-critical it is):

1. People don't realize how much they love and depend on you until you're gone.

2. Keeps you on your toes, it's easy to get complacent when everything just runs along happily for months and years on end.

I do wish there was a way to train users that millions of them reloading constantly as service ramps back up doesn't accelerate the ramp-up time, though. ;)


Yes! If you haven’t had an operational issue with a service in a while, you should force one on a testing to stack to make sure it fails like you expect, you can recover gracefully, your monitors work, etc. (lots of folks call these “game days”). A service that hasn’t had an ops issue in a long time is a ticking time bomb: when it does fail, nobody will be very familiar with it, and environmental assumptions might have changed, which could result in taking a much longer time to recover.

When I design services these days, I try to design them so these failure scenarios are constantly exercised. Eg if I care about multi-AZ resiliency, I try to design it so that it’s forced to fail over to other AZs all the time. Or at the least, write tests for the scenario. Exceptional behavior or code paths are dangerous.


Indeed. Our company's core business occurs in batch, background processing, that needn't be real time. If it doesn't run right now, there's literally no damage to the business or our customers if it runs in an hour. We have a customer-facing website, but there's very little there that can't be served by a cache.

tbh, I can't recall a support request from a customer that was caused by our infrastructure vendor and not our product.


Maybe just because it's been around the longest, but my impression is that us-east-1 seems to have more than its fair share of outages. Personally, for my single-region applications focusing on US customers, I go with us-east-2. Knock on wood.

I'm interested in any evidence to back up my impression if anyone has bothered to do the proper data gathering.

(Aside, stink eye on whoever made a breaking change over a holiday weekend, if this turns out not to be random.)


It’s because us-east-1 is the oldest and by far the largest of any AWS region. Issues get caught and fixed there before they show up at other regions.


I can understand if the networking implications and data replication issues are too complicated for you, but, if you have failover in the same region, you’re already paying the extra cost.

Working with multiple regions is cost friendly on AWS. You should put in the time and learn how that stuff is done, it’s not as complicated as you think.


Or how about no AZ where possible using a serverless architecture/lambda?


“Serverless” is not magic. The minute you need to attach to your VPC (I refuse to say “run inside your VPC, that’s not correct), you still have to worry about multi AZ. If you are using RDS, as opposed to DynamoDB, you have to configure it for multi AZ.

I use lambda all of the time and I’m definitely not afraid of the “lock in” boogeyman, but I always architect my lambda’s to make moving away from lambda to either Fargate or just an EC2 instance as easy as possible.

On another note, it’s just as easy to architect your regular old EC2 instances running stateless servers to be AZ failure resilient. Just set up an autoscaling group with a min/max of 1 and configure it to work across multiple AZ’s.

Also lambda comes with its own set of limitation - maximum runtimes of 15 minutes, cold start times, temporary storage space of only a half a gig, limited CPU/memory options, etc.


https://status.heroku.com/incidents/1892 - it appears Heroku is being particularly affected. We've had multiple sites on multiple accounts go down in the past few minutes.

EDIT T16:31Z: It appears Heroku has failed over their dashboard, but dynos are still failing to come online. We had assumed that they had multi-region failovers for their customers. Incredibly disappointing.


We cannot even restart/turn off the dynos or get into the dashboard to turn off and kill our background tasks for some of our clients.


Heroku: don't restart your app; restarting is broken

Also Heroku: we auto restart your dyno every day!

So now my app is down because Heroku forced it to restart, and none of our hourly employees can work :|


Is it best practice to run the dashboard and the cloud service in the same region of the same cloud provider?


I believe it's not.

As a PaaS I would think that they would run a high availability cluster on at least 2 multiple regions so that they would have a mechanism in place for events like these. I know it's expensive, but if you charge 250 for 2.5GB of RAM I believe you would have enough money to cover it. I also think as you hinted that they should separate services across different regions..


HN discussion of heroku outage: https://news.ycombinator.com/item?id=20846270


Looks to have been caused by a loss of utility power and subsequent backup generator failure at one datacenter.

> 10:47 AM PDT We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most instances but still have 1.5% of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue.

https://status.aws.amazon.com/rss/ec2-us-east-1.rss


I've noticed both Twitter and Reddit were having issues this morning, so this makes sense.



Good call, I didn't put the two together.


I got paged 50 minutes before AWS updated their status page. We are running on AWS's managed Kubernetes offering (EKS), and about one third of our nodes were running in the affected availability zone. We were then able to move all of or traffic out of that AZ, which solved our issues. The main symptom was HTTP requests made by our backend to 3rd party APIs failing, but only on requests originating from that AZ.


Reddit has been quite dysfunctional for me the past hour or so.


Really? I feel like that's been going on for about 5 years or so now


Yes unable to login, noticed it right aware as the bright white glare blasted through what would otherwise be night mode.


Yes. Popular loaded, but my feed doesn’t. Just did, but was short.


Same. I thought it was my WiFi at first.


Amazon JUST had an ec2/RDS failure in one AZ in Tokyo last week; the cause was a bug in their HVAC that led to overheating. I wonder if this is similar or just coincidental.

https://aws.amazon.com/jp/message/56489/


US-East-1 was the first, IIRC. Lots of early adopters have grown significantly but have various hardcoded assumptions about running there.


The Spinnaker project is looking more appealing with every outage. Outage detected in X provider in Y region? Deploy infrastructure to Z provider in Y region.


Or just deploy to us-west-2 (or maybe us-east-2) in the first place and then enjoy not having to do anything every time there’s an outage in us-east-1, which is where pretty much all the AWS outages are.

And if there is an issue for some reason failing over to another AWS region is a lot easier than having to fail over to another cloud provider. Outside of a very small number of cases building for multi-cloud is a lot of unnecessary work.


This outage is only affecting a single availability zone, so taking on the complexity of multiple cloud providers would not be necessary to be resilient against it. AWS best practices would already have covered you.


Where does it say it's a single AZ?


Every "more" drawer/dropdown

> We are investigating connectivity issues affecting some single-AZ RDS instances in a single Availability Zone in the US-EAST-1 Region.


AWS best practices involve using AWS for everything, which is not actually a best practice.


Deploying a database with 5 TB on disk and 50k data-modifying transactions per second to multiple providers isn't exactly trivial.

Who cares about stateless, that's a solved problem.


The volume and storage numbers you quoted call for breaking up a monolithic system. We have a similar overall volume of transactions and storage needs to yours but we reside in all US regions, tens of AZs. If one AZ went down, a small fraction of customers would be affected. True, we have increased our exposure to an outage happening somewhere but we improved overall business continuity. Simply put, dealing with 5% of pissed customers is easier than 50% or all of them calling you at once.


From their Multi-Cloud section on their homepage:

> Deploy across multiple cloud providers including AWS EC2, Kubernetes

Kubernetes... where?


Probably self hosted.


Lol, or just architect your applications to be highly available and fault-tolerant to regional failures if that's your business requirement.


I’m surprised by how much of the “internet” seem to be affected by a single AZ going down.


We wouldn't have this problem if people just used application-layer protocols and federated services like the early internet.


Wait, why wouldn’t we have these problems? Back in the 1980s, if a university campus connection goes down, you can’t telnet in or read your university POP2 email remotely. It’s down.

The only difference between then and now is that we’re online (seemingly) at every waking minute expecting a hundred different services to be functional at any given moment.


Modern services such as reddit and Twitter effectively usurp the role that Usenet/NNTP and similar distributed protocols used to fulfill, but without the advantage of decentralization / lack of large single points of failure that such protocols embraced. That's what I was getting at, and maybe I'm full of shit.

In the 80s if a university campus internet connection went down, only that university was affected. Now, when a single AWS availability zone goes down, a much wider swath of users is impacted. Such consolidation / centralization shows a disregard for the spirit of the early internet and design considerations that went into it.

Again, maybe I'm full of shit. Lots of people here seem to think so.


us-east-1 continues to have continually worse uptime than other regions (for, likely, good reason too, it continues to be the default region).

I've avoided that region and I can't remember the last time I had downtime caused by Amazon.


Also, it is one of the regions that gets new features first, which makes me wonder if it contributes to lower stability.


This is not true. The region where new software is deployed first is different team by team (or service by service).


Perhaps, but I can't recall a product launch that wasn't available in us-east-1 from day 1.


I believe us-east-1 is one of the regions included in the minimal set of regions for a new AWS service to be considered 'available'. If I recall, eu-west-1 is another such region.


us-east-1 is the most likely to have any given new service first.


Is it still the default region? The last two new AWS accounts I provisioned defaulted to us-east-2.


Leaseweb Virginia is having a major outage as well. Maybe it is related?

https://www.leasewebstatus.com/incidents/updated-connectivit...


Copy that. Happy Labor Day weekend everyone.


It's been 2 hours and they still don't have a red flag on https://status.aws.amazon.com/


Cognito went down completely a couple of months ago (started returning rate limited to every request) and despite our contacting AWS to see if there was anything going on (and their confirming that there was) they never updated the status page. The way we got updates was by calling our AWS contact.


This seems to affect a broad swath of the internet, perhaps because the us-east-1 region is so popular? My side project StatusGator shows approximately 15% of the status pages we monitor (including our own) with a warn or down notice right now, a sizable spike over the baseline.


>We are investigating connectivity issues affecting some instances in a single Availability Zone in the US-EAST-1 Region.

Well there’s your problem, people. Use multiple AZs.


Easier said then done if that would mean synchronizing database and filesystem that is heavily written to.


Depends on what you use. RDS can span AZs and failover in events like this.


AWS always understates the scope on the status page.


Curious. Lambda not effected. EC2 being physically tied to a box does introduce extra risk I hadn't thought of.


Lambda would be just as affected if you were running inside of a VPC [1] and you ignored the multiple warnings about setting up your lambda to run in only one AZ.

[1] technically your lambda never runs “inside your VPC” but it’s a colloquialism that everyone understand.


This is pretty good common sense post on not having your failure moods correlate with your client's failure modes.

https://trackjs.com/blog/separate-monitoring/

I don't work for any of the entities mentioned.


Had an app doing fine until about 12 minutes ago, when Heroku tried to move it to a new server. Alas.


Setting dynos to 0 and back got us up, though.


For folks here, my RDS instances in us-east-1f are doing okay (knock on wood!) Not sure which AZ is suffering most.

My client's Heroku instances are online, thankfully.

Can anyone here speak to their experience with the Ohio region? I'm considering leaning on that more and more.


Your us-east-1f is not the same one as on other accounts. The letter is randomly assigned to the AZ to spread load.


Interesting, never knew that. I guess that is why the announcements never explicitly pointed to a single AZ by name.


Essentially everyone picks "a" because it's the first az. There's some internal mapping to "your az a is actually datacenter q". You can kind of figure out which AZs match across accounts if you've got enough accounts you can send traffic between.

I've been told that the "a" AZ you get was the least populated at object creation time (ie the first time you make an object that lives in an az), but I don't know how valid that is.


Aside from spinning up an EC2 node in each AZ and doing ping or tracing tests I wonder if there is a quick-n-dirty way to map AZ’s between different AWS accounts. I’ve never had to approach that scenario (cross account, low latency requirements) but in the future I’ll keep this in mind.


Go to your Subnets tab in the EC2 console. You'll see the actual AZ numbers there, vs. the 'random' lettering.


Is there no way at all to reach Amazon EC2 instances in us-east-1 or is just the default route to the internet broken?

Is there any way for the owners of the instances to reach them?


Is this why Reddit and Duolingo weren't working properly? I've had issues since 9pm Sydney time so about 4 hours or so.


I remember reading about how not all AWS regions are similarly operated and that one was a snowflake. Is it US-East-1?


Yes. us-east-1 is the first AWS region Amazon made publicly available. It's also historically used by a lot of customers as "default region" where they launch all workloads where they don't have special needs of launching them somewhere else.

That has lead to us-east-1 being the largest AWS region by far, also compromised of the largest number of availability zones (6) of all AWS regions.


Ok so it is this one. I was talking more about how the region itself has features, exceptions/quirks that are different than other AWS regions. Basically a quirks-mode region with differences that may, or may not impact you at some point in time. Or you do have special needs and US-East-1 is the only region that has the special non-standard ability you want to use.


Has anyone else noticed that there seems to never be outages in us-east-2 and somehow everyone keeps putting instances in -1?

Why?


For us, it's mostly a matter of historical convention. Our entire stack currently lives in -1 (we've had instances there for ~5 years now), and to move to a different region under these pretenses is a bit of a pain in the ass for us considering how transient the impact of these things has typically been to our business.

If we move anywhere, its going to be completely out of AWS and into on-prem or some bare metal provider. Hopping regions hoping to win at some reliability metric game is not a good way to run a business IMO.


us-east-2 has had a 2-3 hour issue with their internet routes which affected all of us-east-2. 2019 so far us-east-2 has been fine for me. My colo has had only 3 5minute outages this year. Keeping the cloud up is tough if even AWS can have large outages still.


they are just less impactful since it's not used as much... see 4 months ago: https://news.ycombinator.com/item?id=19820302


Because the outages there don’t make the HN frontpage.


Funnily enough Heroku in Europe also seems to be malfunctioning. Cannot deploy my app for at least an hour now.


I'm in Australia and Reddit/Twitter ground to a standstill - request timeout after request timeout. I presumed it was an outage somewhere but was surprised to learn it was with AWS us-east-1? I would have thought surely that my connection would have referenced a different region based on my location.


usually, DB servers will live in a small number of locations with good connectivity between clusters and the frontends (which terminate the user's TCP connection) live much closer (likely Sydney). Good design means that there are few roundtrips between the FEs and the backend but they are not unavoidable.

Designing truly resilient and available applications with DB servers that replicate across continents is hard.


Is true master-master replication across continents even possible?

I guess partitioning can help, but then isn't it just turning the DB servers into pizzas of master-slave where the Hawaiian slice is master only in Hawaii, and slave everywhere else?


Yeah, you're gonna hit CAP hard at that distance.


That must be why reddit and twitter are failing on me.


This leads me to believe it’s more than a single AZ failure, despite what AWS is reporting. Not having multi-AZ, auto failover or replication doesn’t seem like a thing Reddit or Twitter would skip out on.


It looks like it was localized to zone D.


Zone designations are account specific; zone D for you is not zone D for me


The affected AZ appears to be use1-az6. You can map "your" AZ name (us-east-1c, us-east-1d, etc.) to the actual, canonical name of the AZ in the 'Subnets' tab on the VPC console.


What makes you say use1-az6 is the culprit? I only ask because none of our workloads in az6 have experienced any issues. ....yet. We run critical workloads across 3 AZs thankfully, but still.


What's the best way of figuring out which zone it could be in my account?


Running

  aws ec2 describe-availability-zones --region us-east-1 --output text


Aha. Experienced some NPM lag too.


is that why xda developers doesnt work


My little instance died and I had to bring it back from the image.

Glad to know that it wasn't anything personal over any Hacker News gags I've done.


Well, this outage says something about the companies that religiously depend on it.

If your entire service just went down as soon as this happened, Congratulations! You didn't deploy in multiple regions or think about a failsafe/fallback option that redirects from your affected service or instance.


Very few companies or systems need near-perfect uptime. Multi-region cloud engineering, especially once data is involved, is incredibly expensive. If you do need the kind of resiliency you usually engineer it for just a very specific component rather than the entire system.

An outage like this happens how often?

Edit: Looks like this is affecting a single AZ... so bit different situation, but I would agree if you're not capable of surviving a single AZ outage in 2019 then your engineering team should be replaced.


> your engineering team should be replaced

My engineers are all React and CSS web developers. They don't know anything about multi tenant data resiliency. But they can make a real pretty "system down" page.


> Very few companies or systems need near-perfect uptime. Multi-region cloud engineering, especially once data is involved, is incredibly expensive.

Without data it doesn't even cost significantly more than running in a single region nowadays, if you are willing to go serverless. As serverless stuff (FaaS, ...) is pay for what you use and the provider handles the scaling automatically behind the curtain you can easily deploy to multiple regions without much additional cost.

With data you have of course the cost of storing the data multiple times in the different regions (or to come up with some kind of sharding) and solving the consistency challenges that come with that, but at least services like DynamoDB and S3 offer cross-region replication out of the box nowadays and you don't have to provision any capacity like you used to (thanks to DynamoDB AutoScaling and so on).

Once you have your application running in multiple regions you can direct users to the closest one, so they enjoy lower latencies.

I believe for a lot of applications running cross-region just makes a lot of sense as it offers various benefits.


Or the PM team that wouldnt let the engineers do it right needs replaced.

Always CYA guys... you will pay for this if you dont.


Most of the connectivity issues we were seeing were with instances in one of the us-east-1 AZs but we were seeing issues in other AZs in us-east-1 as well. Not sure why AWS is acting like this issue in only affecting one AZ.


For many sites hours of downtime every few months is not critical. If the cost of downtime is less than the cost of reducing them, don't bother.


Yeah, but you have to weigh the cost of multi-region deployments and failsafes vs the cost of downtime. For smaller shops downtime may be acceptable.


It’s not an issue with an entire region, it’s an issue with a single AZ in a region. If you did the bare minimum - set up your RDS with a replica in separate AZ’s, run your servers in an autoscaling group (even with just a min/max of 1) configured for separate AZ’s, used services that are multi-AZ by default - almost all of the managed servers - you could still be up.


Nope, our EC2 and RDS instances are multi-AZ, still got affected. Might look into being multi-region though.


Multi AZ would have also sufficed. This appears localized to one zone, not the whole region.


Several people report problems with multizone deployments, so it seems like AWS is downplaying this.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: