Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AWS down again? (amazon.com)
271 points by raffraffraff on Dec 10, 2021 | hide | past | favorite | 240 comments


I expect better of the community here. All it takes is a chance to take a cheap shot at one of the “big boys” and then all of a sudden the weasels come scampering out of the wood work.

Seriously, those commenting “oh boy! Time to rethink this whole cloud thing!” You’re either so new to this stuff to have no experience to remember the days before cloud, you’re trolling because you’re high on nostalgia remembering the good ol’ days (perhaps with some “member berries” https://youtu.be/mPs8-ZBSjok), or you’ve just straight up forgotten what it takes to build and run your own infra from the ground up.

And to those who happen across this comment before posting your own: please contribute something to the discussion beyond the above. Thanks.


Well at least you said please, however:

I have built and run my own infrastructure from the ground up, and I've been made to transition to the cloud.

The experience hasn't been great. It may simply be sour 'grapes' because, after all the expertise a whole generation has built up learning UNIX, and all the internet protocols (DNS, ARP, Email, reading RFCs, networking, routing) we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

Some of us old timers argued against this only to be shouted down by people who don't even understand TCP/IP.

The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did. Go read a few RFCs and try to imagine a scrum team today putting as much thought into an up front design.

Today we'd just cruft together an MVP, solve only the interesting parts (or more likely the easy parts) and then move on, letting dashboards which lie to cover it up.

Many of us have been taking shots at those 'big boys' since the start of this trend.

Now that because of recent events and we have a chance to be heard you're telling us to be quiet. Why?What are you afraid of?


> The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did. Go read a few RFCs and try to imagine a scrum team today putting as much thought into an up front design.

This is a staggering bit of revisionism. As someone who was around at the time, I remember that many RFCs were written based on already-working code. They had some advantages: nobody much cared what they were doing, so they didn't have to answer to multiple levels of management, and they clearly gave almost no thought to security from bad actors, but if you think there aren't people today--including at those very cloud providers you disdain--doing work at least as well-thought-out as those early pioneers, you haven't been paying any attention at all.

I'll stay off your lawn, but maybe take off those rose-colored glasses and stop pretending the past was rosy.


That's an interesting position to take, because I imagine if a team of developers decided to invent the internet in a vacuum today, it'd be a hell of a lot more secure than the "let's hope nobody uses this protocol maliciously" attitude prevalent in the early days of the internet.

Not that that is a bad thing, but just something to think about.


It wasn't an attitude of naive hope that there wouldn't be nefarious actors leveraging the protocol; there was a different kind of people using the Internet.

There's no need to design security in the system when you literally know everyone who is using it. And everyone who was using it had the same goals in mind.

So, I don't disagree with the sentiment -- people today would probably do it a little bit differently; however, I do disagree with the expression -- people designing these protocols weren't naive. They were trustful because they had to be.

In the early days of building something new, nothing works without trust; not the Internet... not Bitcoin... not a nascent venture... nothing.


While I don't disagree, if people at the time had assumed that everyone on the network could be trusted (forever), why design the IPv4 address space to make room for 4 billion devices? Why support so many ports and concurrent connections? The two assumptions don't quite match up.


> 4 billion devices? Why support so many ports and concurrent connections? The two assumptions don't quite match up.

Because when TCP and it's predecessors were invented there were only a few computers in entire world. In initial ARPAnet there were only 4 hosts (In September 1973 there were 42 computers connected to 36 nodes)

But each computer had many users. That's why there were so many ports, because the thinking was there will be big computers with many users each running their own internet connected clients and servers.

That was true even in the beginning of 1990's, when I want to high school, I had access to Unix shared between 2000+ people.


To expand on what's being referenced here, consider the following: video game speedruns.

Throughout the 80s, 90s, and early-to-mid 2000s, there was a certain level of trust in the claims people made about PBs (Personal Bests) and WRs (World Record/Ranking). There was no practical way to record, host, or especially upload literal hours of footage (VHS footage) of a run you did. Even if you did somehow achieve all of the above, it would be a grainy, low quality video which is hard to see, maybe with a stopwatch nearby so people can verify your claim. People would be watching this through RealPlayer, if they could watch it at all!

So what do you do in such a situation where people have no practical or easy means to verify claims? You build credibility off of how active you are with other members of the community. You post and comment on forums about what strategies you're trying, what difficulties you're dealing with, and what new information you might have uncovered through trial and error. You don't prove your work, you prove your worth. Your standing is evidence of your claim.

To me, this is a great example of "personality-credit" communities that's existed online; Usenet and BBS aside. The mentality has largely faded away with improvements to bandwidth and services like Twitch and YouTube, but considering the technological challenges of what someone in say, 1993 would be dealing with in trying to prove they just set a new record can really give a glimpse into what things used to be like.


People do think about it. RFCs are still being written and revised by people daily. IETF, W3C, and others are publishing new standards like QUIC. We understand the risks a lot better now than when those original RFCs were written because the world has changed a lot since then, and in no small part because of them.


In the early days of the internet it was a closed network of academic and government properties. Nobody at that time would have guess it would grow into even the 80s style internet, let alone what we have now.


"Secure" as in "security by closed source and obscurity, because, hey come on, we need to upsell you on 'enterprise' features"?

Yes, of course. Not really any different from the Internet we have today, though.


> we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

New cloud products are targeted at new companies and services.

If a company has already invested all of the R&D into building self-hosting and they've got it running properly with well-defined and measurable economics, it doesn't really make sense to upend it all and rebuild in the cloud.

But for new services, embracing hosted platforms is a proven accelerant for development. Skip past the solved problems like hosting, get straight to work on the interesting problems your business is trying to solve.

> The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did.

Oh please. This is just "back in my day" ranting about "kids these days" and how you think one generation is superior to the other. Give it up.


Hm, my understanding/memory of the history of internet technologies has a lot more "let's try this and see if it works" than your comment about the role of specing it out and up-front design in the internet would suggest. I think there was actually quite a bit of people in labs just doing things, coordinating informally with people they knew personally in other labs.

Yes, there was also, at various points and times, specs and big-picture thinking, sure.


None of that stuff is legacy. It's just centralized. Economies of scale. Go work for an infrastructure provider, the same way recruiters largely work for recruiting firms instead of all shops having their own in-house.

The fact that the people who could invent the Internet mostly work for a few giants doesn't mean they no longer exist in the current generation.


(Assumedly) much much younger person here who has spent the majority of her career on prem, has lifted multiple shops to AWS, and right now works full time on an all AWS stack.

Your experience with UNIX isn't worthless, but it's worthless to anyone who's working "further up the stack" than you. If you feel like your skills are degrading then you need to find a job somewhere that's actually building infra that shops will build on top of. Your skills aren't day-to-day anymore and that's a good thing. Your generation made all that junk turnkey in the way that you probably think about dealing with Ethernet frames. Taking my first AWS job literally obsoleted all my system's knowledge -- super humbling experience -- it wasn't completely worthless but none of the problems I've spent years working out solutions even existed anymore.

You're confusing people building products with people building infrastructure -- the devops role makes this messy because you're usually "using" infra tools like a dev rather than building them. If you're working on foundational elements then literally nothing has changed. If you're shipping products then absolutely retool around a cloud provider, the infra isn't your secret sauce and if you have to move back on prem because of cost it will be good problems to have.

I mean this completely sincerely, take a job at a company that's providing hosting/services to people. All the old timers with deep deep systems knowledge are gods.


> You're confusing people building products with people building infrastructure

Even though I'm on the side of the "UNIX graybeards" here, this is a super-great point. We do need to recognize that it is a great time to be building networked applications, precisely because the younger ones these days don't need to understand TCP/IP or anything else related to infrastructure.

I confess that I got caught up in the hype as well and built quite a few "multi-AZ" apps that I thought would help me get to five nines that much faster. (for non-cloud folks, that's 99.999% availability, which was something to pursue before the cloud.)

Of course, when those abstractions break, those same younger ones are completely helpless, and my single-server apps have been running non-stop for years at traditional providers except for a few minutes for reboots following updates. I've never had a multi-hour outage, especially one that's completely out of my control where I can only point a finger at AWS and say, "sorry, it's not my fault."


This!

Not to mention the amount of garbage in the cloud, the constant learned helplessness that we have to endure even knowing that the situation could have been avoided or even mitigated/solved if the access to the box was possible.

The status-quo of the cloud is uninspiring to say the least...


I use "cloud" services, but I ensure that my systems continue even if they fail. If AWS is down, maybe I can't do some analysis on historic logs until it comes back, that's a known failure

I look at the components of a system and think "what happens if this is turned off // breaks in an unusual way // goes slow", and ensure that the predicted effects are known and acceptable based on the likelihood of failure.

That's the same whether it's an AWS managed DNS service, storage bucket, or a raspberry pi on my desk. As a systems engineer I know what that component does, what happens when it doesn't "do", and ensure the business knows how to work around it when it breaks.

If your business can't cope with an AWS outage (even if it's not as efficient) then you've got problems.

Plan for failure, and it doesn't take you by surprise.


This reminds me of something a volunteer at our campus observatory told me once while I was in college. I don't recall what we were discussing, but their comment stands out to me today: "Don't engineer things to work. Engineer things not to fail."

Often at work I see code implementations that "work" in that they usually work but can fail. I'm not a great engineer by any means (I'm actually an economist that stumbled into code), but I believe that one of the reasons I've been able to gain a good reputation at work is specifically because I design with that principle in mind.


This conflates two very different things: cut-throat KPI-driven development practices at companies like Amazon, and the entire current generation of software developers. You're taking problems caused by the environment and attributing them to a moral deficiency in the people, which is neither fair nor helpful.


Environment (in general) is what determines skill set, not the other way around. If the interview process makes you focus on leetcode skills, and your manager focuses on LOCs produced and hitting story milestones and less on spending time integrating with the team and learning about legacy codebase; it makes sense those who came out of this environment would be less prepared to tackle certain kinds of problems.


> The experience hasn't been great. It may simply be sour 'grapes' because, after all the expertise a whole generation has built up learning UNIX, and all the internet protocols (DNS, ARP, Email, reading RFCs, networking, routing) we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

Losing skills you worked on for years is just part of this space. We are continually building on new abstractions so that we can focus on building solutions.

This really feels like a rant of kids these days. SCRUM doesn’t mean you can’t do upfront design.


The entirety of StackOverflow runs on something like 4 machines. Abstraction layers are expensive, and having to learn scaling methodologies unnecessarily when a better choice of upfront technology would render it unnecessary is very un-agile.


StackOverlow doesn't run on just 4 machines. Even in 2016 it required significant hardware:

4 Microsoft SQL Servers (new hardware for 2 of them) 11 IIS Web Servers (new hardware) 2 Redis Servers (new hardware) 3 Tag Engine servers (new hardware for 2 of the 3) 3 Elasticsearch servers (same) 4 HAProxy Load Balancers (added 2 to support CloudFlare) 2 Networks (each a Nexus 5596 Core + 2232TM Fabric Extenders, upgraded to 10Gbps everywhere) 2 Fortinet 800C Firewalls (replaced Cisco 5525-X ASAs) 2 Cisco ASR-1001 Routers (replaced Cisco 3945 Routers) 2 Cisco ASR-1001-x Routers (new!)S


Compare this to the many companies spending 6+ figures on AWS, and ask which has more traffic.


IMHO these are entirely different skillsets, and a division-of-labor question rather than some sort of insurmountable generation gap. It's not like infrastructural know-how isn't relevant anymore; they just became CDN engineers or DevOps or senior scaling or reliability engineers. Their jobs are no easier than before, especially when you have to consider network traversals across layers of virtualized/containerized services across multiple data centers owned by disparate parties and maintained by different vendors.

Virtualization aside, we haven't abandoned basic infrastructure, but centralized it in the hands of a few huge, expert providers. IMO this is a good thing, and was both necessary and natural as the Web grew to offer more and more opportunities to more new professionals. In detaching HTML from HTTP from ARP, etc. we gave rise to entire new professions like full-time UX (which arguably the Old Guard was never good at beyond a small audience of academics and engineers), or various flavors of front-end developer, or serverless ecosystems.

The Web and associated technologies advanced so quickly it was impractical for a single IT or network department to know all of it anymore, and some of the newer webapps wouldn't have been possible if that same team or company had to also manage all of their own basic infrastructure like it was the 90s still.

Now you can be a front-end only shop, or a UX consultant, or a network engineer who never has to touch HTML, or, or, or... maybe big enterprises always had and could always have all of those in-house, but the division of labor has been a huge boon for small businesses and startups and nonprofits, who just don't have the same resources.

As someone who grew up configuring zmodem and running BBSes and having to (mis)configure NetBIOS all the time, I am so, so glad I never have to worry about OSI layers and such ever again. It's boring to me, and the experts at it are SO much better at it, might as well let them handle it. Especially when the cost of that outsourcing is often like <$100/mo. Well worth both the time and money... and sanity. The division of concerns lets you focus on the things you're either interested in and/or good at.

Our professionals haven't gotten worse. The stack has gotten much deeper.


I see this as akin to how factory work has changed with modernization. In a modern highly automated factory you need a much smaller number of highly specialized engineers to maintain the robots. In the analogy, these fewer highly specialized engineers are analogous to the ones who need to know understand TCP/IP in depth and now run AWS. The rest of the workers, now replaced by robots, can move on to different productive work.


I mean, because you're wrong? Even with the recent outages, AWS still has far higher uptime and support than anything that someone could cobble together in their own small company. That's the advantage of using cloud infrastructure financed by billions of dollars.

> The current generation couldn't invent the internet.

Come on now, this is an overdone, lame argument and I can't believe you're seriously suggesting this. Do you also lament the fact that kids these days can't bind their own books? The point of tools is to be built upon, not to sit around marveling at your own genius. If you build a good tool, the folks that come after you don't have to think about it. That's how you make progress.


> The current generation couldn't invent the internet.

As an old fart myself, this is very cheap argument. Old generation wasn’t superior. Old generation couldn’t invent transistor, how useless we are!

You always move up the stack as tech progress and that’s a good thing.


> Now that because of recent events and we have a chance to be heard

Is there anything about this particular incident that is new that contributes to your position?

Maybe we run in different tech circles, but I feel like the "on-prem vs. cloud" has been litigated fairly extensively here and elsewhere. In fact, as you said yourself:

"Many of us have been taking shots at those 'big boys' since the start of this trend."


Interesting. The infra engineer at my previous company fits your description, yet he was the one pushing for our AWS migration.


Your own lack of knowledge about how to make the cloud work properly doesn't mean it's completely useless. The "old school" knowledge is still very useful in building and troubleshooting cloud-based infrastructure. You're creating a false dichotomy.


False dichotomy, if you say so.

I haven't found it too helpful when dealing with AWS and the serverless trend, whose popularity is really just based on price and economics, not technical superiority.

Serverless turns every simple system into a distributed system with the number of failure modes now multiplied by ten.

That sounds like fun.

I do know how to use serverless for the record, I just think it's an overhyped, overpriced waste of time.


You just moved the goal post from "the cloud" to "serveless". Anyway, this discussion is full of overgeneralizations that are not useful. I'm abstaining from further comments.


I'm not moving anything, just augmenting my point. Reread my comment and do s/serverless/cloud/ if you want, I still stand behind it either way.


[flagged]


You're just trolling. "The cloud" doesn't exist as a single entity. I'm sure people in other various AWS regions, all GCP, all Azure, all DigitalOcean is working just fine.


Y'know, debates about cloud vs in-house uptime aside, there's one thing I'm really grateful to the cloud vendors for: making downtime somebody else's problem.

Rather than a late-night panicked run to the smoking server, now we can just shrug and wait a few hours and it fixes itself. Most websites aren't that critical, and it's nice not having to lose sleep over devops issues.


I mostly agree, but this is a double-edged sword. I recently had a 6 hour outage due to lots and lots of waiting on a fix from our managed database provider. Had we been running the DB ourselves, I would have had the permissions to monitor the DB and caught the issue early (it was a pretty trivial thing to monitor). If we had gone down, I would have been able to run my restore within an hour, rather than 6. And lastly, self-hosted performance was much better than it's been with a managed DB.


I find most cloud users still have to panic at 3am when there is a cloud outage because usually there are knobs to tweak in their application to mitigate the outage. For example, do a region failover, turn off some feature that depends on dynamodb, or push an emergency release to make the application server handle spurious 403 errors from an Amazon backend.

Since Amazon failure modes are so varied, it's impossible to make the above tweaks fully automated. There'll always be a new way the cloud can misbehave, and often your service can at least maintain partial service by being nimble enough.


If you're going closer-to-the-metal cloud, then yeah. I think that's part of the allure of newer, more abstracted cloud hosts like Vercel, Netlify, Gatsby, etc.

They take care of all the infrastructure for ya, and if something breaks, you just twiddle your thumbs while they twiddle the dials.

Of course this sort of laziness is not a valid approach if you're a huge enterprise or running some live-saving project that needs seventy nines of uptime, but for non-critical websites, it's a great sigh of relief...


>making downtime somebody else's problem.

It is even more, making downtime's responsibility somebody else's problem. You can now simply point to AWS is down, AWS is slow, AWS is causing error, and there is nothing we can do about it :). And as long as management knows everyone is having the same problem they are perfectly fine with it.


Do people not take responsibility for their choice of vendors? If boeing crashed a bunch of planes and said "Not our fault, our subcontractor made shitty software" would we accept it? If a startup leaked all their customers data would we accept "Blame this other company, we just gave them all our data for analysis" as an excuse?

If you accept that then it means that we need to investigate and scrutinize the whole supply chain of any product before trusting it and it would also mean that any company switching even the smallest part of their supply chain would require us to reevaluate their product.

When we buy things (either products, services or software) we consider the company selling it to be responsible for the whole and deferring responsibility to a vendor is basically just saying that you were either promising more than you knew could deliver or that you did not do a good job picking vendors.


It used to be "No one got fired for buying IBM." Now it's AWS. Many CTOs are very risk averse, and having someone to blame is incredibly valuable to them. My company has a large multi-region set of datacenters, but we're using AWS for a lot of new projects, despite being able to deliver similar services with better performance and cost, on-prem.

Some of the reasons are no one wants to be seen using old tech; stuck being the last COBOL dev. So there's a big incentive for developers and their managers to be using the latest tech; it looks good on the devs resume and grants them job security in the field, and it makes their managers look like they're sharp and on top of new technology.

Also, recruiting people familiar with AWS services is often easier than finding someone versed in on-prem tech. And for companies in less than desirable locations, this helps remove a staffing issue since it can all be done remotely.

And when something goes wrong on-prem, the CTO is expected by the rest of the C-suite to fix the issue. When it's a reputable 3rd party, he can absolve both himself and his team of responsibility.


> he can absolve both himself and his team of responsibility.

Everything up until this sentence makes sense in a twisted sort of way, but that's the part I don't get. When did we absolve companies from their choice of vendors? If I buy a service and that service does not work then the company selling that service does not get to blame their vendors, that they chose, just like they don't get to blame individual devs that they hired.


The company trusts the CTO; the CTO trusts AWS, and if AWS goes down, he can deflect responsibility onto AWS. The other C-Suite officers are largely ignorant about tech (that's why they have a CTO), and nod their heads.


>When did we absolve companies from their choice of vendors?

>just like they don't get to blame individual devs that they hired.

When the vendors become a symbol, a public utility status or practically the only choice. And this isn't a Tech company thing. It can be applied to all other industry. You can fire and hire another Dev. If you are looking at cloud vendor, you only have AWS, Azure and GCP to choose. For lots of reason the latter two may not be an option due to competition or features. And AWS is the best you can get, are you sure the Dev they have are the best they can hire?


> a symbol, a public utility status or practically the only choice

The first, I can agree with, but the latter two are definitely not the case. If they were a public utility they would be regulated as one and if they were they only choice there would not be companies online during these outages. The site you are writing this on is an example, an online service with a pretty large (although not massive) following that is online enough to be the place where all the people who work at companies that depend on cloud services come to discuss their outages.


>Public Utility

It is not whether they are one as in business or one as in politics that has to be regulated. They are utility like as people treated it as one and it behaves as one. Many things goes down with it with AWS just like many things goes down with the grid.

HN is comparatively tiny and simple. Doesn't requires any of the features on AWS. And has no reliability requirements like SaaS. HN doesn't even use CDN. Even though I am not a cloud advocate for 99% of cases it makes zero sense to build a SaaS with your own Server.


Life-critical industries like airlines, cars, healthcare, etc. are probably the exception?

If a restaurant ran out of some menu item for a few days because a supplier didn't have a good harvest, eh, it happens. If a bookstore doesn't stock a certain book, no big deal. Grocery store self-checkout broken yet again? Oh well. Band got too drunk and can't perform? Reschedule it. No snow at the slopes? Climate change.

In the real world, shit happens, and most of it isn't critical.

If you offer a 99.999% guarantee and your customers pay for it, well, you better deliver -- subcontractors or not. But many businesses don't need to do that and a few hours of downtime a year is just a minor inconvenience. To go from "a few days of downtime" to "a few hours" is really easy; just about any provider can offer that. To go from "a few hours of downtime" to "a few seconds" is a huge investment and not worth it to many folks.


> You can now simply point to AWS is down

Heck, if that was even possible... "everything is green" dashboards... :-)


Obviously, the answer is to pay for a cloud dashboard-of-dashboards that show orange lights when AWS's is falsely green.

"We need this new fancy dashboard to monitor our other dashboards in one place, more accurately!"


It's not "weasels" scampering out of the wood work.

This is a community of highly technical users expressing legitimate frustration and concern that the biggest player in cloud hosting is

  a) having a service impacting event, and
  b) the status page says everything is up
That's why people talk about the beneits of alternate hosting arrangements. If the status page said what was happening, there would be no need to ask "is AWS down?" the status would be clear.


> oh boy! Time to rethink this whole cloud thing!

The problem isn't the cloud, the problem is all of the web properties going down all at once, exacerbated by the fact that AWS charges a premium for any viable cross-region / multi-cloud architecture (given its relatively high egress fees). This is discounting the inter-dependence among AWS services themselves, some of which aren't multi-region (afaik).


> I expect better of the community here. All it takes is a chance to take a cheap shot at one of the “big boys” and then all of a sudden the weasels come scampering out of the wood work.

Engineers like to build stuff. Historically, running your own infrastructure and configuring servers from the ground up were the building blocks of the internet. These new cloud services took away those building blocks and they had the nerve to charge for it.

In the real world, there are more interesting problems to solve than setting up and maintaining your own servers so it isn't really a complaint. It's actually more fun to get the hosting stuff out of the way and focus on the problem at hand.

HN comments are basically notorious for exaggerating the downsides of hosted services while overplaying the ease and benefits of DIY. A lot of the comments here are similar to the famous Dropbox comment thread where users couldn't understand why anyone would want to use Dropbox when they could simply set up a complicated self-hosted rsync contraption to sync their own files.


Serverless is just buzzword for a container on a virtual machine on a server. The Cloud is just a buzzword for a company with a data centre selling virtual servers from other big servers.

> You’re either so new to this stuff to have no experience to remember the days before cloud

I do, and it was much more peaceful.

Everything you can do on "in the cloud" you can do in colocation. Which in the long-run is cheaper, more secure* and it's yours! Including the data.

There are caveats, network, component failure. Investment in to these and you can have a pretty king setup.

The cloud enabled magnitude of email spam, brute-forcing, botnets, security vulnerabilities and much more. Operators are lazy don't want to combat it. Providers are bias, then again you can say that about any business.

People flock to the cloud like it's the greatest thing, when all your buying in to is a expensive price-plan for a company who will happily knock you off their service if you somehow brush up the wrong way and then charge you for a closed account.

I do laugh when something goes wrong for FANG. Partly, I'm cynical and want to see the world burn, but it's also these companies exploit their userbase, their staff and the environments resources. When Facebook locked themselves out of their own offices due to the BGP issue, now that's funny.

My colocation costs are:

$5000 covers for a three-year 1Gbit 2u server racking space in two different DCs. Where I have full-control, as many services as I desire, allowed to host what I desire and where the internet space is actually mine. It may be a small cube of internet but I know it's my network, my traffic.

Cloud is whitewash for me, I won't buy in to it. It has it purposes and if your happy with it, fine. But for me Colo for life.


Preach it brother!

Also, you should consider renting out space on your cube.

You could even design APIs around the internet protocols so they can create their own DNS records, mailboxes. Wait...


I'd like to remind everyone about Uber's experience: no EC2-like functionality until at least 2018, probably even now. Teams would negotiate with CTO for more machines. Uber's container-based solution didn't support persistent volumes for years. Uber's distributed database was based on friendfeed's design and was notoriously harder to use than DynamoDB or Cassandra. Uber's engineers couldn't provision Cassandra instances via API. They had to fill in a 10-pager to justify their use cases. Uber's on-rack router broke back in 2017 and the networking team didn't know about it because their dashboard was not properly set up. Uber tried but failed to build anything even closer to S3. Uber's HDFS cluster was grossly inefficient and expensive. That is, Uber's productivity sucked because they didn't have the out-of-box flexibility offered by cloud.

Now, how many people think they can do better than Uber?


Most of those issues sound like management/process issues, not technical issues. Our staff can't provision AWS instances without extensive review...


You're correct that the knee jerk reactions are a bit over the top, but what is so interesting is how little the benefits of cloud are discussed here on HN. People who haven't used cloud native services think it's just EC2 and someone else's computer. But the idea of compute and data all being controlled through a single set of APIs, controlled by the same security syntaxes, is pretty incredible. Of course I'm a fan of the idea of on-prem alternatives, but the unified architecture is nice.


Disagree on two counts.

1) Like any utility, the cloud is predicated on the assumption that it never goes down. If AWS itself goes down, it causes knock-on effects more most mid-sized (i.e. 1 region, multi-AZ) services on their platform, so we can't have it.

2) If we're going to have to accept downtime (1) then yes, it IS time we rethink the cloud--either how we can achieve no downtime, or how we achieve fault-tolerance in some other way.


There is nothing wrong with pointing out the warts. I wonder if you've ever been at the mercy of google/aws and their lack of customer service if you aren't representing the business $$ of a fortune 500 company?


I don't think anyone is saying they could do better than AWS at making their own infrastructure. I think the complaint is that it's a single point of failure for a huge swath of the internet.


[deleted]


There never was an outage. It was just the marketing home page for aws.amazon.com that was down for about 30 minutes and everyone had a knee jerk reaction to “haha AWS down went down again!!” Everything that mattered works fine.


Thank you for saying this.


I think this adds some momentum to the pendulum swinging back the other way. Maybe cloud teams can patch your services better than your in-house team can (See 2 critical issues in Azure the last 3 months, _caused_ by MS itself).

Maybe the cloud has a higher uptime than your on-premise infrastructure (see the AWS, Azure outages). Make sure to compare the actual outage time v.s. the stats doctored by various political pressures and weaselly worded SLAs (how do you mean you had an outage? Only 49% of your requests were failing!).


Even if you can manage more uptime on your own than through the cloud (which I doubt), being on the cloud means downtime is correlated with downtime of other services. That's usually a good thing.

Your customers will be more understanding if your outage is part if a wider outage that makes national news.

Any services you integrate with are likely down too. If two services with 99% uncorrelated uptime together drops to about 98%. It doesn't drop if the downtime is perfectly correlated.

Even if you don't directly integrate, your customer's workflow might. They see many services down and say, "cool, time to get caught up on laundry". If only you're down it's more aggrevating.


> That's usually a good thing.

For any individual company able to offload the blame, that's great. It's not so great if half the countries' doorbells, robot cleaners, various home streaming service setups, the baby camera, the fridge, the smart TV and your phone stop working... All at the same time.

In my opinion, the 'downtime' really should be measured in $NUM_SERVICES_STOPPED X $TIME, instead of just $TIME. And in this case I think any long time Amazon outage is orders of magnitudes worse than your regular old slow IT company outage.


Wha? We host everything on our hardware (which is nothing special) and haven't had any downtime in this year (yet). And we're just another run of the mill dev shop, very far from "superstars" who work on these (supposedly extremely stable) platforms.


When I worked as an IT person for a cheap hotel back in the day, I set up a single Compaq PC as the only server (Samba NT DC, file sharing, IP masquerading, web/mail/DNS server, etc) in a closet. It was not on any battery backup, there was no modem. It ran for years without ever rebooting. In a city where power outages were normal.

Similarly, I've had EC2 instances run for years without ever being rebooted or going down. One of them's still running today after 5 years.

But none of those services were being used 24/7; if the internet went down for an entire weekend, or a hard drive was a little bit corrupt but kept running programs in memory, I probably would never have noticed. I've also had EC2 instances literally just fall off the map and sort of disappear, and had them manually replaced without notice by AWS, had virtual drives fail and corrupt, and had calls to services fail. And I've had my desktop's power supply get fried by a power surge.

Without a lot of experience, running systems seemed easy. But as time went on I learned that it can be easy, and it also can go down if somebody blows on it the wrong way. What we see as being reliable may just be chance. The only way to guarantee reliability is to expect that things are going to go down, and design and build it accordingly.

What AWS makes easy is they give you all the components for reliability, but you have to do the plumbing yourself. I'll bet you the people whose services went down did not properly design for reliability, as they were probably running in one region, in one set of AZs, and relied on distributed system operations that can fail, and didn't properly account for how to deal with those failures. One product I maintain on AWS did not go down, but another did.

Also, the more components a system has, the higher probability there is of failure. Big systems are actually more error prone than small ones.


Two questions I have regarding your in-house hardware:

1. How easy can I access your physical servers ? 2. What happens if there is a catastrophic failure, for example local power outage or a major flooding 3. How secure is your server? Are you regularly patching your operation systems 4. If I want to run a project that requires double the capacity of your current hardware for a specific project, how long is it going to take to get it spun up?


I'm running about 10 servers myself in production. They just do transcoding, so aren't mission-critical, but...

- Regularly patching is automated and took about 30 seconds to configure, using an automated script

Regarding running your own physical servers, that is a different ballgame, but for all of my projects, if I need to:

- I can pretty easily spin up VPSs / bare metal servers anywhere (netcup, linode, hetzner, etc) and provision there while I wait for new hardware to come in - If you want to double the capacity of your current hardware, you'll have to order it and wait, but it's cheap (vs the major cloud providers) to way over provision if you're running your own physical hardware, so you can pretty easily have 2-4x extra capacity and still come out with extra money in your pocket.

I host in the cloud, but I think people vastly over estimate how much it saves 90% of cloud customers.


There are many approaches which don't depend on AWS, and not all of them mean hosting your own physical servers, and they certainly don't mean you don't have an off-site backup policy.

There are well-understood answers to all your questions, they are not too difficult, they just cost money - some businesses choose not to spend that money, some weigh cost-benefit and go for AWS, some decide to go for in-house servers, some go for hosted Virtual Servers, some go for serverless.

Why does everything have to be built one way?


> they just cost money - some businesses choose not to spend that money, some weight cost-benefit and go for AWS, some weigh cost benefit and decide to go for in house, some go for hosted Virtual Servers, some go for Serverless.

I'd also say some choose not to spend the money, but fail to consider the cost of that choice.

For example: Doing old-school manual deployments that require herculean efforts to update at off-hours on the weekends burns people out, and makes it hard to attract new talent. In other words, you've made the decision to spend more money on finding and retaining people. But it's definitely way cheaper to pay for colocating a single Dell server you bought 3 years ago than what you'd spend in the same time on AWS.

And if your hardware never dies, paying for the redundancy might seem silly. A lot like paying for fire insurance despite the fact your house has never even burned down.


Your choice of straw man here is interesting.

Of the options I mentioned, you seem to have picked only self-purchased hardware with no redundancy and no backups (your addition) to compare with, and also throw in manual deployment (why?).

Most businesses are somewhere between a forgotten old dell server in the closet and fully hosted multi-region auto-scaling fully bought in to AWS, and that's OK!


i am not saying everyone needs to go for AWS and there are many good reasons for self hosting However, if the only reason for self hosting is a higher uptime than AWS or GCP, it is general false economy, and you are likely taking short cuts that you are not aware of.


This is a false dichotomy (I listed many options, not just self hosting, not many businesses fully self-host these days), and uptime is provably significantly better away from the big hosted services which are constantly churning features, config and hardware resulting in outages for everyone hosted on them.


Not the person you're replying to, but...

1: Access to the data center requires either an access card plus biometric verification (fingerprint in one DC in my case, retina scan in the other), or an ID-verified appointment. Then, you still need to know where my server is, plus the access code for the rack (or, you need to be an average lockpicker, but beware, there's cameras...).

2: Each data center has dual-provider AC feeds, plus generators, and provides A/B feeds to my rack. I've not had a dual-feed outage in the last 20 years or so.

3: No cloud provider that I'm aware of guarantees server security or does automated patching (for servers, not services). So, keeping your server up-to-date seems equally important for both cloud and non-cloud scenarios?

4: At least two weeks, I guess? I have sufficient VM host capacity to accommodate 30% unplanned growth, but 100% would require new hardware. So: ordering two servers, installing these in two data centers. But if the new project also requires significant bandwidth, getting new Internet connections in might take longer.

Look, I'm definitely not denying that "the cloud" makes it easier to scale fast, but scaling fast is not an overriding concern for most businesses. Cost is, and self-hosting, even with a pretty redundant infrastructure, is still much cheaper than AWS.


> Even if you can manage more uptime on your own than through the cloud (which I doubt)

Getting higher uptime is super easy for smallish inhouse deployments.

You just don't install any updates and let the server run, trusting your VPN to shield you from possible security issues.

The maintenance burden is the reason why people often prefer the cloud services, not the uptime. Because maintaining the instance with updates, reading all patch notes and steps for migration, keep every health metric monitored and respond quickly on issues without getting stuck googling for possible reasons is quite a bit of work and quickly forces you to employ n+1 people.


This is particularly true with modern fully solid state hardware. Uptimes of decades are likely possible if you don't mess with anything.

We run some bare metal servers. They just never go down. Solid continuous pings for years as monitored from elsewhere on the Internet. That's because they're just boxes on a rack somewhere running an OS and some steady-state services (ZeroTier roots). Simplicity is more robust than complexity.

SaaS is definitely about the pain of managing and upgrading software, but it's also about OPEX vs CAPEX. Many companies will pay more for things to put them in the OPEX column for various entirely synthetic accounting, investor relations, and tax reasons.

I do wonder if the pendulum there will swing back though since if you price out cloud vs. physical hardware the market has become extremely distorted. Many companies spend enough on AWS to buy an entire rack of hardware at a different data center every month and pay 2-3 employees to manage it. That hardware would be up to 100X as fast and powerful as what they rent at AWS and bandwidth would be almost free. That's a really distorted market. The amortized costs should not be this different.


>Many companies spend enough on AWS to buy an entire rack of hardware at a different data center every month and pay 2-3 employees to manage it. That hardware would be up to 100X as fast and powerful as what they rent at AWS and bandwidth would be almost free. That's a really distorted market. The amortized costs should not be this different.

I think the issue here is that it's not zero-cost to switch. Your processes will adapt to some implicit assumptions that aren't true outside AWS, Azure, or whatever vendor you locked yourself into.

If we somehow managed to have a completely standardized interface here, the market would be more competitive.


Things like OpenStack are promising if they can become easier to install and run.

https://www.openstack.org


It’s been a decade. Openstack is still a disorganized, unstable mess. It’s not gonna happen if it hasn’t by now.


That is true. But hope a drive by ddos doesn't pick you that day!


How do the attackers ddos an inhouse deployment that cannot be accessed from the internet without joining a VPN?

They can try to saturate the VPN host maybe, but that's going to be challenging considering that it's going to be limited to connection requests without valid credentials.

and these are likely set to be ignored on multiple failed attempts through fail2ban or similar tooling


You have a connection to the internet with an IP address do you not? That connection can easily be saturated with traffic and fail2ban would not have a chance to do anything.


Zero Day VPN vulnerabilities


> being on the cloud means downtime is correlated with downtime of other services. That's usually a good thing.

A long time ago, I used to circulate snarky little emails at work.

One of them was responding to this very concept.

My managers were throwing out our working and mature UNIX servers (implementing DNS, Mail, and other services) in favour of NT.

The new system crashed a lot, we had some security breaches, but at least it was 'industry standard.'

Managements' response was that with the UNIX stuff we had no one to pin our outages on. Now we could blame Microsoft, and call their support line.

I circulated an email making fun of this justification, which promoted a fictional product called 'Blame Studio' which would help you map out the blame path for any of your products or services.

It would help to make sure that none of the blame ever landed on you, but rather was always redirected onto some other company.


My desktop PC has better uptime than the cloud right now. So does my datacenter. In fact, I am not sure I have anything electronic that doesn't have better uptime than AWS or Azure.


I'd rather have control over my circumstances than the ability to assign blame for them.

"my outage means the customer is probably also down so they maybe don't care" is not a viable way to run a business in my mind.


An our on premise linux server recently reached an uptime of 1000 days. Yes, days.


Do you not patch your servers? Or you don't need reboot for Linux devices?


I dont know, we received a happy email from our IT team to celebrate 1000 days of uptime just few days ago.


Can't speak to their particular config, but at least LivePatch for Ubuntu can apply most updates without the need to restart.


Why would I want to be down when everyone else is down? That's a period where I can get customers from everyone else.


> Your customers will be more understanding if your outage is part if a wider outage that makes national news.

Yes they definitely are.


Or they could see their entire product line break at the same time, in such a way that they cannot mitigate


If you happen not to be on the major cloud platform where everybody is enjoying donwtimeshare, we provide you with some downtime monkeys that will pull the lever EXACTLY when your platform should go down!

Let no more have your integrations with systems in the cloud frustrate your customers when they are unreachable - let the news explain the downtime. JOIN the Downtime Umbrella NOW and receive 5 downtime lever pulls for FREE! /s


One problem facing on-prem orgs today is the sad state of commodity server hardware; poor quality control, buggy firmware (and vendors who won’t help), BMCs with massive attack surfaces (that have already claimed one VPN provider), and commodity network switches that can push lots of packets but aren’t terribly flexible for people who are pushing more challenging payloads around their DC (video, etc).

"Hyperscalers" like Amazon, Microsoft, Facebook and Google build their own hardware and are able to avoid many of these problems. Unfortunately, none of this stuff is available off the shelf to mere mortals.

There’s a startup trying to fix this problem (Oxide) which I think is launching their racks next year. Will be interesting to see what happens.


The learned helplessness in the cloud is stupefying, so many outages and downtime that could have been avoided by a competent admin.


Also, not all uptime is equal or worth of the price.

What would be more valuable to you:

1. 99.5% uptime where unscheduled downtime is max 10 minutes vs.

2. 99.5% uptime where unscheduled downtime comes in 2-5 hour chunks.


Another key point of this: Selecting your downtime. If you own your own stuff, your major changes and maintenance are done at times favorable for your business. AWS or Azure configuration fat-fingers happen when best for their business, not yours.


> Maybe the cloud has a higher uptime than your on-premise infrastructure (see the AWS, Azure outages).

Not even so sure about that. I've had a ton more downtime ("degraded" in AWS speak) with AWS than any self-hosted systems. And that's with more than half my career on self-hosted.

If a major disaster strikes, like the whole rack catching fire and melting everything, then it's true that AWS could recover quicker than self-hosted. But most problems are not of that sort.


I don't see any momentum in that direction at all. If anything, it's quite the opposite.

These outages have almost zero impact on anyone deciding to use a cloud provider.


Agreed. In fact, I've heard government decision makers actually swayed to go to the cloud due to outages with logic along the lines of, "If Amazon can't keep AWS running, what chance does our 30 person, chronically underfunded, team have?" There is also safety in having the outage be somebody else's fault, and having that outage effect a huge swath of the economy - "Sorry customer, but it's not just us." Reminds me a bit of the "nobody gets fired for picking IBM" mentality.


I've been getting the "We're sorry!" error for at least 15 minutes. No idea about regions or anything, I was just about to look up their docs about moving accounts between orgs etc. Anyone affected?

Edit: Seems to be flapping between the 'sorry' error and a blank page. Thoughts and prayers with the SREs, if they call 'em that over there.


Same here. And as always, AWS status page [0] is completely green and useless.

[0] https://status.aws.amazon.com/


Management console seems to be accessible fine to me, maybe it's just the aws.amazon.com website that's down.


Odd that a major landing page like aws.amazon.com isn't super HA. At a previous $work, the "marketing" page (ie: main dot com site) was horribly mis-managed and had countless outages due to various embarrassing reasons. Every time it was down, customers complained that the product was down (it wasn't, but people used the dot com page to get to the login page).


Which means people suggesting one another to compare their downtime with AWS' should stop suggesting it, as you cannot rely on their data at all.


The only thing that was down was the marketing home page at aws.amazon.com, which is not a “service” and is not on the SHD. All of the actual services are fine, hence the green.


Because everything works fine? So a landing page does not work you need to put defcon 3?


How on earth is putting a little red X or whatever next to the service that isn't working, when it isn't working, defcon 3?


Landing page is not on the aws status page.


And if it was, how is marking it as down a defcon 3 situation? My point was more that your original comment is a gross exaggeration.


The whole thing reminds me of the early nineties.

There was a whole mature infrastructure built around UNIX and open standards at the time.

We were told to scrap that and replace it with Windows NT and its descendants.

While UNIX wasn't perfect¹, Windows was an order of magnitude worse.

Compared to our UNIX machines, those systems got the internet protocols wrong, were full of security holes and crashed all the time. Objectively worse performance and security was tolerated for a decade or more, just because we all convinced each other that this was the way things were going.

It took Microsoft maybe ten fifteen years to work their problems out. And now there's a Linux env built into Windows.

This cloud thing isn't the end of computing history, just like Windows on the server wasn't the end of history.

Cloud isn't better than on prem. It's just popular right now.

1: https://en.wikipedia.org/wiki/Morris_worm


The worst part of AWS outages is that their status page is always green. It's OK to fail but it's not OK to hide the failure and expect all the customers to stumble upon the failure especially for a company as big as AWS. Not OK!


I wonder why the error message is also in … French? (I’m not located anywhere French speaking).

> Désolés!

> Une erreur s'est produite lorsque nous avons tenté de traiter votre requête. Soyez assuré que nous travaillons déjà à la résolution du problème que nous pensons trouver très rapidement.


Mine is in Japanese:

> 申し訳ありません。

> リクエストの処理中に問題が発生しました。 現在問題を調査しておりますので、解決するまでもう少々お待ちください。


[flagged]


I don't follow. It seems like an almost exact translation to all the other languages for this error page.


You've never pondered the beauty and provenance of phrases like moushiwake arimasen or omachi kudasai, I take it?

I mean, you could say they mean we're sorry and please wait, but that's also just wrong in a whole other way.


Japanese has multiple politeness levels, in this particular case the message is using keigo which is the most polite. At a guess I'd say that the machine translation is identical all the way down.


Thanks. I know nothing of Japanese so this was confusing to me.

I'm thinking about it, and I guess there's "politeness levels" in English, but probably not as much so, and usually in a form of passive aggresisveness.

That's fascinating. So the "style" of language they use has a huge impact on the message, possibly more so than the actual words? Very McLuhanesque :)


You even conjugate the verbs differently for your customers.

Richard Feynman talked about this point really frustrating him and turning him off Japanese entirely. He just wanted to use words to get things done...granted for informational purposes it's nice to not have to conjugate the verb for download based on who downloaded it.


I kind of get that sentiment. Fundamentally I think it's beautiful and I'm delighted by such variety in human languages. But I spent 10 years learning French and it drove me nuts how many different ways you conjugate verbs.


I am in India and my message is in English and Japanese. Never seen a Japanese message on aws before.


I'm in Brazil and also read the message both in english and french


I'm in Canada so sort of make sense to show bilingual page.


Unless you're Quebecois, in which case the province would prefer French appear first. _smirk_


Same in the Netherlands.


${jndi:ldap://AWS Down}?


For those out of the loop: this is a reference to the Log4j RCE story that's also currently on the HN frontpage.

Amazon uses a lot of Java, so the two stories might be actually connected somehow.

[0] https://news.ycombinator.com/item?id=29504755


Spoiler: They are. AWS and Amazon both used Log4j exploitable from the homepage (/) of both sites through headers.

Source: me


I'm still able to log in to the console. Seems to just be the AWS landing page that is down.


Me too. Though it does seem to be any page for that specific domain. They all return an HTTP 500 (Internal Server Error).

Google "site: aws.amazon.com" and try any of the links.


Only the most informed and rational companies will see how and why most of this "in the cloud" thing is a bad idea.

Most will accept it as a fact of life and continue to pay for it both directly and indirectly, as long as there's cheap money going around the cloud business can't do wrong.

But it's hilarious to see people indulging in byzantine "World scale" resilient systems that depend on a single vendor.


On the contrary, downtime is inevitable and I think I'd rather have downtime when all my competitors do too.


How about having uptime while your competitors have downtime?


If your system integrates with any external systems or APIs, it's likely they have downtime when one of the cloud giants are down, so sometimes being up is not so relevant. If your system is up but everything you depend on is down, how useful is that?

For (almost) entirely self-contained systems it can still be useful, of course. But everything wants to be interconnected to everything these days...


That's why we proudly don't integrate with any third-parties at all, and all of our stuff is developed and hosted in-house. Even "trivial" things like email or billing.


Which, given that downtime is inevitable, leads to you having downtime while all your competitors have uptime.

Have fun with that…


If your non-cloud uptime is even 50% more then the uptime cloud offers, then you’re throughout the year, up longer then others.


If you are 3 competitors on your market and you might lose x orders while your are down, but gain 2*x orders when the other 2 are down, you are ahead (with the same probability of going down as the others).

Real world is not so easy though...


Would you rather have to fix your own data center, or wait 4 hours.

AWS works 99% of the time,plus it's someone else's problem


I absolutely prefer to have the option to go into a datacenter in a hurry and actually fix stuff and be in charge, then be stuck with having to wait an indefinite amount of time, twiddle my thumbs, apologize to customers and hope for the best.

While I considered myself a decent Windows NT admin, 20 years ago, the reason I went all in on Linux and FLOSS software at the turn of the century was because I dreaded the powerlessness that these proprietary solutions gave me, when they failed. You'd call the vendors, pored through logs, finding obscure, undocumented error codes, etc. With FLOSS and self-hosting youve got all the information at your fingertips. And if you encountered bugs and you can dig into the sources, patch them, re-compile and fix things - and share them with others and feel that you're contributing to our profession.

When I got the chance to do cloud projects over the last 5 - 10 years, I always took these opportunities, hoping to ensure that I keep up with the tech. At first I was hopeful to offload the boring ops tasks, take our config management to the next level and automate even more. With every platform I got to work with, AWS, Azure and GCP so far, we kept finding bugs in their APIs, outdated or otherwise incorrect documentation and very unpredictable performance, unless you actually can run stacks at scale to average it out (as in more then just 10 - 20 instances or a larger clustered SaaS of the cloud vendor). Many times we also encountered undocumented limits that required requesting support and waiting for approval by the cloud vendor, to get even their mid-sized resources allocated. It all works very nicely on the free-tier-eligible, smallest instances and services, if as slow and high latency as is to be expected, but as soon as you actually need some decently sized storage, compute or bandwidth, it becomes quickly more expensive than what you can put in two or three datacenters for redundancy yourself, if you look at the yearly costs. So far none of the PoCs I was involved with ever got approved long term. They mostly end up as reference implementations for our customers or show case material for the corporate blog. :-(

No thanks, I don't want to go back to feel that powerless as I did on closed systems ever again. Luckily, although many seem to think that cloud is the only option to run at a global scale, you can still provide lots of valuable services on the internet using robust hardware, housed in well connected datacenters.


You may be correct, for some very specific use cases. No matter how much corporate clients want to holler, a few hours of downtime isn't going to hurt anyone.

I know I'd rather someone else do it, then having to drive 2 hours away to a data center at 4:00 in the morning. But I don't know exactly what you're working on , I definitely can't imagine some use cases where a few hours of down time is just unacceptable. I know I wouldn't want to run a logistics firm with servers that go down all the time.


In my industry, five nines is the starting level. You're proposing something 1,000 times worse.


If you have to have 5 9's starting with any cloud service is pretty tough sell at this point. You can build that type of system ontop of them. But it takes a willingness to also build your own and use competing services adding even more complexity. The BYO or competing services bit is so you can keep the lights on even when the cloud eats it, not if. Part of 5 9's is planning for catastrophic failure. Things like 'what if the load balancer/router/backhaul/dns burns out, you have a spare on hand? Then what happens to services where you had them sticky?' Even then what if both the primary and secondary are out? What then? Lots of planning and making sure you know what to do when each of those cases happen.

The nice bit though is many services can go down. Yeah it stings a bit (money, reputation, time, etc). But overall it is not that big of deal.

But for the places where you can not go down. Tons of planning and tons of backup plans with backup plans, and a different style of producing code.


Besides the fact that they used "99 percent" in a colloquial, informal way ... is that really how calculations like that work, or should work? I mean, if you have 99 percent availability, and 99.999 percent availability, maybe it's valid to say that the former is 1000 times worse than the other, but that could also be true by saying that 99.999 percent availability is 10,000 times worse than 99.9999999 percent availability (where a service only goes down for a third of a second in the span of a decade), and that 99.9999999 percent availability is infinity times worse than something that hasn't had an outage within whatever monitored period (since 100 percent availability is the same as ninety nine dot nine repeating availability)


Then AWS isn't really targeting you as a customer though. Most of the web really isn't that critical and can survive a few hours outage occasionally.


But for those who bought into AWS as a way of improving their performance and stability, "a few hours outage occasionally" can be very expensive. If my company experienced a multi-hour outage "occasionally,", it would be $100K/hour in lost revenue. Your remark reminds me of people who say "Well, Tesla really doesn't mean `autopilot` in the way you think."


You don’t need to build your own data center to host outside AWS.

You can fully manage your hardware and software fleet while still paying a host to provide data center service, and networking too.


Do you think people who manage their own data-centers never have outages?


Running your own data center is on the far other end of the spectrum. There are numerous options between solely using AWS and building a data center.


And people have outages in all of them.


When they do, they at least have the ability to do something about it.


My last outage was a power outage on a sunny day that lasted slightly longer than our battery backups. Without economies of scale, installing/maintaining extra generators for a small self-managed data center is prohibitively expensive.

There's always a service provider somewhere in the chain who can drop the ball.


Like shout at your IT and vent?


Which may not be good for your uptime: https://youtu.be/tDacjrSCeq4


Some us remember playing Chinese fire drill and it wasn’t fun.


No, not necessarily.


>But it's hilarious to see people indulging in byzantine "World scale" resilient systems that depend on a single vendor.

As far as I can tell, there really hasn't been an AWS outage where you couldn't have avoided issues with multi-region and some careful selection of which products you're using. Which isn't much different from what you would have to do with your own infrastructure.


Try calling "your ops guy" during Thanksgiving or when he is sloshed in a bar when your colocation servers stop responding. No one wants this job anymore. Current technology jobs have already enough unnecessary complexity to deal with uptime as well.


On the other hand try calling "AWS" in the same scenario. My money is on sloshed guy coming back with the faster response.


Have you heard about the "No True Scotsman" logical fallacy?


I don't use Cloud for their uptime, it's all about their managed services for me.


Are you arguing that purchasing and maintaining your own servers is cheaper than using AWS? Let's hear the rationale, then.


Yes, on-prem is generally cheaper in the long-run.

For example, my company recently purchased $400k worth of hardware that would cost $80k/month if it was all EC2. There's amortized costs of the supporting infra (cooling, power) and ongoing maintenance/depreciation, but paying for the hardware in 5 months can't be beat.


You forgot the salaries of the network technicians you wouldn't have otherwise. You now have to manage attacks on your infrastructure yourself. You left out bandwidth costs as well. Your servers will need to be replaced entirely every 5 years, and partially as hardware fails.

It sounds like you're at sufficient scale that it makes sense. For most people, it doesn't make sense.


I wonder if it's the log4j thing... AWS is built on Java, and I'd be very surprised if log4j isn't used heavily there.


I think it is them frantically patching it. There are some Amazon staffers posting in the RCE reddit thread: https://www.reddit.com/r/programming/comments/rcxehp/rce_0da...

> Our slack group for this issue is at 3,400 people, haha. It'd be funny if I wasn't one of them. > Where do y'all work that has 5000 employees on a single issue?? > One that has an arrow under it's name.


This still seems to work for me, replace region with your region:

https://<region>.console.aws.amazon.com/


Testes this and it works fine. Thanks!


I don't think you said what you think you said


*tested


This worked for me too, thanks for the tip!


Ah, the debate between managing your own infrastructure or renting it from someone else. Always a fun debate, often centering around uptime far more so than cost.

I'll go ahead and make what will seem like a weird comparison. I sometimes think about this as I do about Microsoft Office.

What?

Over the years the MS Office suite has reached an asymptotic limit with regards to functionality users would actually be willing to pay for. I'll say that somewhere between 2013 and 2016 the suite had everything most users would need to do the usual stuff one does with Word, Excel, PowerPoint, etc. In other words, it became good enough and maybe even better than good enough.

This is the parallel I see with Linux servers and the Internet infrastructure needed to run a reliable site or service. Not just the OS, but the system as a whole has, over time, approached an asymptotic limit on functionality, reliability, uptime, ease of use, capacity, bandwidth, management, etc. Not sure we are at that limit yet. It sure feels like we must be close. As many have echoed, today one can deploy a range of servers in multiple configurations and achieve extremely good uptimes with as much functionality as needed without necessarily having to go to cloud service providers.

Is it the same? Not sure. I haven't operated at the highest scales in building web services, so I don't know. My gut feeling is that short of things like seriously large DDoS attacks, a self-owned infrastructure could do just as well as something hosted by a cloud provider. With knowledgeable engineers running the show there should not be any issues in setting-up, managing, maintaining and supporting such a structure.

Note that I am not hating on cloud services. No. What I am saying is that because technology tends to get better over time, it is only natural that the alternative approach has gotten better and better to the point that it would be perfectly sensible for a business to consider rolling their own rather than automatically resorting to cloud providers.


Yup, my traders are getting the "sorry!" error when trying to log into QuickSight, so looks like another great day at AWS.


Still no post mortem from Monday's right? We're all still in the dark?


When I've worked with AWS and other providers before, they've typically given early indications about potential root causes under NDA. If you're a small customer, you probably get to wait with the rest of us but the big folks probably have some idea of the root cause.


Can confirm this


> Still

it's usually a month or more after a large outage to see the full breakdown on what happened. people expecting to see it the same week are kinda not living in reality.


Fair I guess I would have expected more in the way of official comms though. Most of what I've seen is based on deduction and speculation rather than from the horse's mouth


Status page is all green. All must be ok! /s



(from France)

at first:

  We're sorry!
  An error occurred when we tried to process your request. Rest assured, we're already working on the problem and expect to resolve it shortly.
and now:

  This page isn’t working
  aws.amazon.com is currently unable to handle this request.
  HTTP ERROR 503


Guess whos status page is all green?


I know it is not recommended, but for small (-to-medium) websites you can use DNS failover. I run non-critical services but still I like to have above-average uptime, so I have them configured in two separate physical locations by two different providers. From the point of view of visitors, they had zero downtime in the last 7 years, in spite of several hardware failures (disks, motherboard, power supply - all resolved within an hour). I know it because I tested it multiple times from various location checking what happens if one of them dies.

Of course if you have millions of visitors a day, you'll do better with a more robust setup, but mine costs me €80/month.


It seems like AWS and Google have had more outages than normal recently. I guess this is to do with COVID disruptions? (remote working and such).


This morning I’d bet more on a rushed log4j patch. They use it heavily.


It's kind of terrible that a logging library can lead to such downtime.


When you need to force update all of prod asap. Yes it is terrible. The fact log4j even does remote network calls is crazy


why would remote workers matter? AWS is basically all remote.


Because change can be disruptive.


I just logged in via standard console and things seem to be fine.

That direct URL aws.amazon.com gives me a weird error, but everything else seems to be fine.


Have a couple LightSail instances that all went offline around the time that the homepage did and still cannot connect. Everything in the dashboard says they are healthy and online but all attempts to access them (either just access the webpage or SSH) timeout.

Edit: SSH access has been restored if using their web client, everything else still times out


Amazon’s going to have to start paying out on some SLAs if this goes on for much longer…


There’s no SLA for the console, IIRC.


Every time I see a situation like this one, I tell myself: "What a wonderful opportunity to go outside and smell the roses".

P.S. A comment from an exhausted engineer to another.

P.S.S. Seriously, I need to get out of the house more.


Remind me why this is better than a 10$/year VPS


You can't possibly hope to match the reliability of AWS with a tiny VPS


Mine has exceeded AWS' uptime in the last 5 years. I think. I wish I had the actual numbers. If not, it's only because I actually didn't care very much about uptime, and did an in-place system upgrade on the live server late in the evening, and when it broke I waited till the next morning to fix it.

Also, the rather rare and brief maintenance window from the provider is always middle of the night for all my customers.


$10… per year? Where?


Yeah, lowest DO is $5 per month. Not even Vultr is that low.


Learn to use search engines and translators, if you stop at Vultr you're BARELY scratching the surface.


I think there is also a level of trust you need with your VPS. Every exciting that there are cheaper options, they do have their uses.


IONOS has a $2/mo (or is it $2.50) option, so that's getting close to $10/yr. But that's not very capable.


I see no impact on my infrastructure and the dashboard is still accessible on my side. So I'm not sure about exactly what is down.


Guess it's just the homepage then.


When will people stop paying for Multi-AZ ops. You basically pay double for this - and - when one region goes down - aws is hosed.


Is web.archive.org using itself AWS? I tried archiving the error page, but I only get a "Sorry / Job failed" page...


I get the "We're sorry!" page on this link, but when I go to console.aws.amazon.com it works.


Yes. I am trying to access the console and there is an error message. I am from India.


Perhaps AWS just hasn't paid their networking egress fees to Amazon, you know?


I'm not seeing any issues with my services is useast1 or uswest2.


Just went up for me, but it was indeed down earlier.


We are sorry page is all I got.


btw status page doesn't show anything


OFC not. They'll deny there's a problem until it goes away.


log4j? :-)


Anyone feel like Big Tech outages are happening more frequently recently? In recent months, we’ve seen Amazon Web Services, Facebook, Gmail, and Twitter go down.

Are we at the point where the people who maintain the infrastructure are now completely different than the ones who built it, and are struggling to keep it running because they don’t understand it as well?


I was thinking something similar. The FAANG interview has been optimized to punish candidates with actual experience in these problems in favor of those that can solve leetcode problems and regurgitate design architectures from youtube. This filters out a lot of engineers with grit that can actually work through tough real world (IT) problems.


I don't even work at FAANG but even we have our own in-house grown compute platforms with layers and layers of abstractions and enormous complexity which were built over several years by hundreds of engineers. These don't resemble public cloud or open source solutions at all, ideas are similar at best. You can have a lot of real world small company knowledge but the moment you walk through the door this experience is worthless, all that counts is how fast you can adapt and how good your understanding of the basics is.

Look at some of the code that was open sourced by Yahoo years ago. Other large engineering organizations today have technology which far exceeds the complexity of what was made public by Yahoo a decade ago. Unfortunately Yahoo is pretty much the only example of a large company open sourcing big portions of their platform code.


Those people, the gritty ones who can't make it through the interview hoops are busy integrating companies into the cloud providers. Now they're taking it easy whilst the new kids are getting a first class lesson in what grit is.


https://en.wikipedia.org/wiki/Availability_bias

Twitter used to go down so frequently the "fail whale" was a whole cultural thing.

Big AWS outages are rare, but hardly new.

https://www.theregister.com/2017/03/01/aws_s3_outage/

https://arstechnica.com/information-technology/2012/10/amazo...


Yep I remember the same issues with Reddit. They’re fairly stable now, but up until about 2015 they had outages maybe once a month or so.


I think also just finally getting to the point where they have a tech debt burden comparable to everyone else. Starting fresh has its advantages...for a while.


Even on HN there are some former AWS employees who talk about how its all stitched together and flying on a wing a prayer. Apparently the on-call is just a traumatizing experience. It will take real damage to revenue for the management to pay that debt off.


Is there a case of an organization ever paying off "tech debt" (I refuse the term, I call it incomplete software)? I've only ever seen it snowball until the product falls into the sea and they start again fresh.


A well-maintained car can drive for a very long time, but for a poorly maintained car, restoring it to like-new condition rarely makes financial sense. It's usually cheaper to just buy a new car, and/or do the bare-minimum maintenance until it dies completely (then buy a new car).

I'm not sure tech debt can ever be fully paid down. You can pay off a bit, and you can stop the debt from accumulating further, but the only way to realistically unburden yourself of a really horrible, big pile of tech debt is to just rebuild the thing from scratch. Or don't, and just keep making money while you wait for the business to fail. (From an investor's perspective, this is the equivalent of riding the car into the ground.)


I loved a quote someone else wrote here some time ago: "Hackers are just tech debt collectors" [1]. If ever there has been a true quote, it is this.

[1] https://news.ycombinator.com/item?id=29039611


I've never seen it. It's kind of like credit card debt. The people that have it get progressively more to pay off the previous debt until they reach bankruptcy.

So I do think tech debt is accurate, in that the average human deludes themselves into thinking they will pay it off at some point.


You never pay it off. It's like "the war on crime" or "the war on drugs". Don't call it a war if there is no clear win. You can never win the war, but you MUST win the battles in order to not lose it.


I think the term is fine. I don't don't know if is the "ever accumulating" version is correct as at some point it levels off in my experience. At some point things break and you have to take care of it.


It seems like they could at least cordon some of it off into a corral of "legacy products". Or maybe a concept of "next gen" regions that have only a subset of products.


As the layers of the stack increase in depth the probability that any one piece breaks the system approaches one. There will be a Singularity but it will be a horizontal asymptote instead of a vertical one.


I'd say the scale is the problem, not the people.

Every service you mentioned (maybe sans Gmail, the only additions I remember in the last twelve years are a new UI and Hangouts) has bolted on so many features over the last years: AWS was virtual machines + SDNs in the beginning as Amazon only intended to sell spare capacity on their own servers, now it's a global one-stop-shop for everything that can be done on the Internet. Facebook was a social media feed, now it's event coordination, groups, chat, image and media hosting at global scale.

And apparently, no one at these organizations ever thought about re-working their infrastructure with "lessons learned over the last decade" in mind. Every new feature was simply bolted on, on top of an infrastructure that was hardly even envisioned to ever become the scale they are today. And that sort of refactoring costs serious amounts of money and developer time, not to mention that it doesn't make sense to develop new features on a code base that's going to be shut down in a year, so management doesn't approve it out of a fear they will be "out-featured" by a competitor and cannot react (=copy, like Instagram's Stories that were a clear rip-off from Snapchat) in time or that their own PKI/OKR goals and with it their bonus payments won't get hit.

That mindset/scale issue is also why IBM mainframes are still so common, why travel PIRs seem to be stuck in formats over half a century old or why "put CSV files on an FTP server" is the standard on bank transfers... big corporate/government clients pay a shitload of money for virtualized mainframes on new hardware that still can run the 70s-era code and even more money for people able to speak COBOL, because that is still cheaper than the alternative - reworking everything from scratch on a modern foundation, testing data integrity and edge cases, revise interfaces to hundreds or thousands of clients. Hell, even Internet standards have the same problem... we are still using protocols like BGP that have been around since before I was born, and tacked on security only a few years ago after a couple of fat-finger incidents.

Modernization in such entities only tends to happen when laws or regulatory frameworks change, and then it can become a real shitshow for those at the lowest rungs of the IT ladder that have to implement them - simply take same-sex marriages and try to shoehorn them into a database that was labeled for "husband and wife", or trans/inter people with gender data represented by a boolean field.


I wonder if we're hearing the full story. Could they be being attacked by foreign hacker farms in Russia and China just to make the US infrastructure look bad?


maybe we actually have a case for decentralized “web3” networks. At least ones which really need reliability, but can’t can use whatever e.g. airplanes use (telehealth? online exams?)

Of course you need to make sure the network is actually decentralized (see Solana). Or maybe you can just rely on Amazon and GCP and Azure, because surely they won’t all fail at the same time.


The whole of Amazon isn’t falling, just some services in one region, it is easy enough to build region redundant services but most people don’t because the few hours of downtime per year don’t seem worth it.

You usually just don’t need 100% uptime. “Sorry were closed, come back later” is fine.


Turning my phone off right fucking now!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: