AWS down again?

koprulusector · on Dec 10, 2021

I expect better of the community here. All it takes is a chance to take a cheap shot at one of the “big boys” and then all of a sudden the weasels come scampering out of the wood work.

Seriously, those commenting “oh boy! Time to rethink this whole cloud thing!” You’re either so new to this stuff to have no experience to remember the days before cloud, you’re trolling because you’re high on nostalgia remembering the good ol’ days (perhaps with some “member berries” https://youtu.be/mPs8-ZBSjok), or you’ve just straight up forgotten what it takes to build and run your own infra from the ground up.

And to those who happen across this comment before posting your own: please contribute something to the discussion beyond the above. Thanks.

gerbilly · on Dec 10, 2021

Well at least you said please, however:

I have built and run my own infrastructure from the ground up, and I've been made to transition to the cloud.

The experience hasn't been great. It may simply be sour 'grapes' because, after all the expertise a whole generation has built up learning UNIX, and all the internet protocols (DNS, ARP, Email, reading RFCs, networking, routing) we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

Some of us old timers argued against this only to be shouted down by people who don't even understand TCP/IP.

The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did. Go read a few RFCs and try to imagine a scrum team today putting as much thought into an up front design.

Today we'd just cruft together an MVP, solve only the interesting parts (or more likely the easy parts) and then move on, letting dashboards which lie to cover it up.

Many of us have been taking shots at those 'big boys' since the start of this trend.

Now that because of recent events and we have a chance to be heard you're telling us to be quiet. Why?What are you afraid of?

pwinnski · on Dec 10, 2021

> The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did. Go read a few RFCs and try to imagine a scrum team today putting as much thought into an up front design.

This is a staggering bit of revisionism. As someone who was around at the time, I remember that many RFCs were written based on already-working code. They had some advantages: nobody much cared what they were doing, so they didn't have to answer to multiple levels of management, and they clearly gave almost no thought to security from bad actors, but if you think there aren't people today--including at those very cloud providers you disdain--doing work at least as well-thought-out as those early pioneers, you haven't been paying any attention at all.

I'll stay off your lawn, but maybe take off those rose-colored glasses and stop pretending the past was rosy.

julianlam · on Dec 10, 2021

That's an interesting position to take, because I imagine if a team of developers decided to invent the internet in a vacuum today, it'd be a hell of a lot more secure than the "let's hope nobody uses this protocol maliciously" attitude prevalent in the early days of the internet.

Not that that is a bad thing, but just something to think about.

tres · on Dec 10, 2021

It wasn't an attitude of naive hope that there wouldn't be nefarious actors leveraging the protocol; there was a different kind of people using the Internet.

There's no need to design security in the system when you literally know everyone who is using it. And everyone who was using it had the same goals in mind.

So, I don't disagree with the sentiment -- people today would probably do it a little bit differently; however, I do disagree with the expression -- people designing these protocols weren't naive. They were trustful because they had to be.

In the early days of building something new, nothing works without trust; not the Internet... not Bitcoin... not a nascent venture... nothing.

int0x2e · on Dec 10, 2021

While I don't disagree, if people at the time had assumed that everyone on the network could be trusted (forever), why design the IPv4 address space to make room for 4 billion devices? Why support so many ports and concurrent connections? The two assumptions don't quite match up.

unionpivo · on Dec 12, 2021

> 4 billion devices? Why support so many ports and concurrent connections? The two assumptions don't quite match up.

Because when TCP and it's predecessors were invented there were only a few computers in entire world. In initial ARPAnet there were only 4 hosts (In September 1973 there were 42 computers connected to 36 nodes)

But each computer had many users. That's why there were so many ports, because the thinking was there will be big computers with many users each running their own internet connected clients and servers.

That was true even in the beginning of 1990's, when I want to high school, I had access to Unix shared between 2000+ people.

registeredcorn · on Dec 10, 2021

To expand on what's being referenced here, consider the following: video game speedruns.

Throughout the 80s, 90s, and early-to-mid 2000s, there was a certain level of trust in the claims people made about PBs (Personal Bests) and WRs (World Record/Ranking). There was no practical way to record, host, or especially upload literal hours of footage (VHS footage) of a run you did. Even if you did somehow achieve all of the above, it would be a grainy, low quality video which is hard to see, maybe with a stopwatch nearby so people can verify your claim. People would be watching this through RealPlayer, if they could watch it at all!

So what do you do in such a situation where people have no practical or easy means to verify claims? You build credibility off of how active you are with other members of the community. You post and comment on forums about what strategies you're trying, what difficulties you're dealing with, and what new information you might have uncovered through trial and error. You don't prove your work, you prove your worth. Your standing is evidence of your claim.

To me, this is a great example of "personality-credit" communities that's existed online; Usenet and BBS aside. The mentality has largely faded away with improvements to bandwidth and services like Twitch and YouTube, but considering the technological challenges of what someone in say, 1993 would be dealing with in trying to prove they just set a new record can really give a glimpse into what things used to be like.

matt123456789 · on Dec 10, 2021

People do think about it. RFCs are still being written and revised by people daily. IETF, W3C, and others are publishing new standards like QUIC. We understand the risks a lot better now than when those original RFCs were written because the world has changed a lot since then, and in no small part because of them.

hnlmorg · on Dec 10, 2021

In the early days of the internet it was a closed network of academic and government properties. Nobody at that time would have guess it would grow into even the 80s style internet, let alone what we have now.

otabdeveloper4 · on Dec 10, 2021

"Secure" as in "security by closed source and obscurity, because, hey come on, we need to upsell you on 'enterprise' features"?

Yes, of course. Not really any different from the Internet we have today, though.

PragmaticPulp · on Dec 10, 2021

> we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

New cloud products are targeted at new companies and services.

If a company has already invested all of the R&D into building self-hosting and they've got it running properly with well-defined and measurable economics, it doesn't really make sense to upend it all and rebuild in the cloud.

But for new services, embracing hosted platforms is a proven accelerant for development. Skip past the solved problems like hosting, get straight to work on the interesting problems your business is trying to solve.

> The current generation couldn't invent the internet. You know why, cos they would never have the patience to spec it out like the old timers did.

Oh please. This is just "back in my day" ranting about "kids these days" and how you think one generation is superior to the other. Give it up.

jrochkind1 · on Dec 10, 2021

Hm, my understanding/memory of the history of internet technologies has a lot more "let's try this and see if it works" than your comment about the role of specing it out and up-front design in the internet would suggest. I think there was actually quite a bit of people in labs just doing things, coordinating informally with people they knew personally in other labs.

Yes, there was also, at various points and times, specs and big-picture thinking, sure.

nonameiguess · on Dec 10, 2021

None of that stuff is legacy. It's just centralized. Economies of scale. Go work for an infrastructure provider, the same way recruiters largely work for recruiting firms instead of all shops having their own in-house.

The fact that the people who could invent the Internet mostly work for a few giants doesn't mean they no longer exist in the current generation.

Spivak · on Dec 10, 2021

(Assumedly) much much younger person here who has spent the majority of her career on prem, has lifted multiple shops to AWS, and right now works full time on an all AWS stack.

Your experience with UNIX isn't worthless, but it's worthless to anyone who's working "further up the stack" than you. If you feel like your skills are degrading then you need to find a job somewhere that's actually building infra that shops will build on top of. Your skills aren't day-to-day anymore and that's a good thing. Your generation made all that junk turnkey in the way that you probably think about dealing with Ethernet frames. Taking my first AWS job literally obsoleted all my system's knowledge -- super humbling experience -- it wasn't completely worthless but none of the problems I've spent years working out solutions even existed anymore.

You're confusing people building products with people building infrastructure -- the devops role makes this messy because you're usually "using" infra tools like a dev rather than building them. If you're working on foundational elements then literally nothing has changed. If you're shipping products then absolutely retool around a cloud provider, the infra isn't your secret sauce and if you have to move back on prem because of cost it will be good problems to have.

I mean this completely sincerely, take a job at a company that's providing hosting/services to people. All the old timers with deep deep systems knowledge are gods.

_hyn3 · on Dec 10, 2021

> You're confusing people building products with people building infrastructure

Even though I'm on the side of the "UNIX graybeards" here, this is a super-great point. We do need to recognize that it is a great time to be building networked applications, precisely because the younger ones these days don't need to understand TCP/IP or anything else related to infrastructure.

I confess that I got caught up in the hype as well and built quite a few "multi-AZ" apps that I thought would help me get to five nines that much faster. (for non-cloud folks, that's 99.999% availability, which was something to pursue before the cloud.)

Of course, when those abstractions break, those same younger ones are completely helpless, and my single-server apps have been running non-stop for years at traditional providers except for a few minutes for reboots following updates. I've never had a multi-hour outage, especially one that's completely out of my control where I can only point a finger at AWS and say, "sorry, it's not my fault."

tuldia · on Dec 10, 2021

This!

Not to mention the amount of garbage in the cloud, the constant learned helplessness that we have to endure even knowing that the situation could have been avoided or even mitigated/solved if the access to the box was possible.

The status-quo of the cloud is uninspiring to say the least...

iso1631 · on Dec 10, 2021

I use "cloud" services, but I ensure that my systems continue even if they fail. If AWS is down, maybe I can't do some analysis on historic logs until it comes back, that's a known failure

I look at the components of a system and think "what happens if this is turned off // breaks in an unusual way // goes slow", and ensure that the predicted effects are known and acceptable based on the likelihood of failure.

That's the same whether it's an AWS managed DNS service, storage bucket, or a raspberry pi on my desk. As a systems engineer I know what that component does, what happens when it doesn't "do", and ensure the business knows how to work around it when it breaks.

If your business can't cope with an AWS outage (even if it's not as efficient) then you've got problems.

Plan for failure, and it doesn't take you by surprise.

hellojesus · on Dec 10, 2021

This reminds me of something a volunteer at our campus observatory told me once while I was in college. I don't recall what we were discussing, but their comment stands out to me today: "Don't engineer things to work. Engineer things not to fail."

Often at work I see code implementations that "work" in that they usually work but can fail. I'm not a great engineer by any means (I'm actually an economist that stumbled into code), but I believe that one of the reasons I've been able to gain a good reputation at work is specifically because I design with that principle in mind.

strken · on Dec 10, 2021

This conflates two very different things: cut-throat KPI-driven development practices at companies like Amazon, and the entire current generation of software developers. You're taking problems caused by the environment and attributing them to a moral deficiency in the people, which is neither fair nor helpful.

SQueeeeeL · on Dec 10, 2021

Environment (in general) is what determines skill set, not the other way around. If the interview process makes you focus on leetcode skills, and your manager focuses on LOCs produced and hitting story milestones and less on spending time integrating with the team and learning about legacy codebase; it makes sense those who came out of this environment would be less prepared to tackle certain kinds of problems.

organic_popcorn · on Dec 10, 2021

> The experience hasn't been great. It may simply be sour 'grapes' because, after all the expertise a whole generation has built up learning UNIX, and all the internet protocols (DNS, ARP, Email, reading RFCs, networking, routing) we get told that all that old stuff is just 'legacy,' and that we should retool around amazon's proprietary services instead.

Losing skills you worked on for years is just part of this space. We are continually building on new abstractions so that we can focus on building solutions.

This really feels like a rant of kids these days. SCRUM doesn’t mean you can’t do upfront design.

snovv_crash · on Dec 10, 2021

The entirety of StackOverflow runs on something like 4 machines. Abstraction layers are expensive, and having to learn scaling methodologies unnecessarily when a better choice of upfront technology would render it unnecessary is very un-agile.

oxfordmale · on Dec 10, 2021

StackOverlow doesn't run on just 4 machines. Even in 2016 it required significant hardware:

4 Microsoft SQL Servers (new hardware for 2 of them) 11 IIS Web Servers (new hardware) 2 Redis Servers (new hardware) 3 Tag Engine servers (new hardware for 2 of the 3) 3 Elasticsearch servers (same) 4 HAProxy Load Balancers (added 2 to support CloudFlare) 2 Networks (each a Nexus 5596 Core + 2232TM Fabric Extenders, upgraded to 10Gbps everywhere) 2 Fortinet 800C Firewalls (replaced Cisco 5525-X ASAs) 2 Cisco ASR-1001 Routers (replaced Cisco 3945 Routers) 2 Cisco ASR-1001-x Routers (new!)S

snovv_crash · on Dec 14, 2021

Compare this to the many companies spending 6+ figures on AWS, and ask which has more traffic.

technobabbler · on Dec 10, 2021

IMHO these are entirely different skillsets, and a division-of-labor question rather than some sort of insurmountable generation gap. It's not like infrastructural know-how isn't relevant anymore; they just became CDN engineers or DevOps or senior scaling or reliability engineers. Their jobs are no easier than before, especially when you have to consider network traversals across layers of virtualized/containerized services across multiple data centers owned by disparate parties and maintained by different vendors.

Virtualization aside, we haven't abandoned basic infrastructure, but centralized it in the hands of a few huge, expert providers. IMO this is a good thing, and was both necessary and natural as the Web grew to offer more and more opportunities to more new professionals. In detaching HTML from HTTP from ARP, etc. we gave rise to entire new professions like full-time UX (which arguably the Old Guard was never good at beyond a small audience of academics and engineers), or various flavors of front-end developer, or serverless ecosystems.

The Web and associated technologies advanced so quickly it was impractical for a single IT or network department to know all of it anymore, and some of the newer webapps wouldn't have been possible if that same team or company had to also manage all of their own basic infrastructure like it was the 90s still.

Now you can be a front-end only shop, or a UX consultant, or a network engineer who never has to touch HTML, or, or, or... maybe big enterprises always had and could always have all of those in-house, but the division of labor has been a huge boon for small businesses and startups and nonprofits, who just don't have the same resources.

As someone who grew up configuring zmodem and running BBSes and having to (mis)configure NetBIOS all the time, I am so, so glad I never have to worry about OSI layers and such ever again. It's boring to me, and the experts at it are SO much better at it, might as well let them handle it. Especially when the cost of that outsourcing is often like <$100/mo. Well worth both the time and money... and sanity. The division of concerns lets you focus on the things you're either interested in and/or good at.

Our professionals haven't gotten worse. The stack has gotten much deeper.

Consultant32452 · on Dec 10, 2021

I see this as akin to how factory work has changed with modernization. In a modern highly automated factory you need a much smaller number of highly specialized engineers to maintain the robots. In the analogy, these fewer highly specialized engineers are analogous to the ones who need to know understand TCP/IP in depth and now run AWS. The rest of the workers, now replaced by robots, can move on to different productive work.

phailhaus · on Dec 10, 2021

I mean, because you're wrong? Even with the recent outages, AWS still has far higher uptime and support than anything that someone could cobble together in their own small company. That's the advantage of using cloud infrastructure financed by billions of dollars.

> The current generation couldn't invent the internet.

Come on now, this is an overdone, lame argument and I can't believe you're seriously suggesting this. Do you also lament the fact that kids these days can't bind their own books? The point of tools is to be built upon, not to sit around marveling at your own genius. If you build a good tool, the folks that come after you don't have to think about it. That's how you make progress.

justapassenger · on Dec 10, 2021

> The current generation couldn't invent the internet.

As an old fart myself, this is very cheap argument. Old generation wasn’t superior. Old generation couldn’t invent transistor, how useless we are!

You always move up the stack as tech progress and that’s a good thing.

zdragnar · on Dec 10, 2021

> Now that because of recent events and we have a chance to be heard

Is there anything about this particular incident that is new that contributes to your position?

Maybe we run in different tech circles, but I feel like the "on-prem vs. cloud" has been litigated fairly extensively here and elsewhere. In fact, as you said yourself:

"Many of us have been taking shots at those 'big boys' since the start of this trend."

tjpnz · on Dec 10, 2021

Interesting. The infra engineer at my previous company fits your description, yet he was the one pushing for our AWS migration.

gtirloni · on Dec 10, 2021

Your own lack of knowledge about how to make the cloud work properly doesn't mean it's completely useless. The "old school" knowledge is still very useful in building and troubleshooting cloud-based infrastructure. You're creating a false dichotomy.

gerbilly · on Dec 10, 2021

False dichotomy, if you say so.

I haven't found it too helpful when dealing with AWS and the serverless trend, whose popularity is really just based on price and economics, not technical superiority.

Serverless turns every simple system into a distributed system with the number of failure modes now multiplied by ten.

That sounds like fun.

I do know how to use serverless for the record, I just think it's an overhyped, overpriced waste of time.

gtirloni · on Dec 10, 2021

You just moved the goal post from "the cloud" to "serveless". Anyway, this discussion is full of overgeneralizations that are not useful. I'm abstaining from further comments.

gerbilly · on Dec 10, 2021

I'm not moving anything, just augmenting my point. Reread my comment and do s/serverless/cloud/ if you want, I still stand behind it either way.

ok123456 · on Dec 10, 2021

[flagged]

gtirloni · on Dec 10, 2021

You're just trolling. "The cloud" doesn't exist as a single entity. I'm sure people in other various AWS regions, all GCP, all Azure, all DigitalOcean is working just fine.

technobabbler · on Dec 10, 2021

Y'know, debates about cloud vs in-house uptime aside, there's one thing I'm really grateful to the cloud vendors for: making downtime somebody else's problem.

Rather than a late-night panicked run to the smoking server, now we can just shrug and wait a few hours and it fixes itself. Most websites aren't that critical, and it's nice not having to lose sleep over devops issues.

christophilus · on Dec 10, 2021

I mostly agree, but this is a double-edged sword. I recently had a 6 hour outage due to lots and lots of waiting on a fix from our managed database provider. Had we been running the DB ourselves, I would have had the permissions to monitor the DB and caught the issue early (it was a pretty trivial thing to monitor). If we had gone down, I would have been able to run my restore within an hour, rather than 6. And lastly, self-hosted performance was much better than it's been with a managed DB.

londons_explore · on Dec 10, 2021

I find most cloud users still have to panic at 3am when there is a cloud outage because usually there are knobs to tweak in their application to mitigate the outage. For example, do a region failover, turn off some feature that depends on dynamodb, or push an emergency release to make the application server handle spurious 403 errors from an Amazon backend.

Since Amazon failure modes are so varied, it's impossible to make the above tweaks fully automated. There'll always be a new way the cloud can misbehave, and often your service can at least maintain partial service by being nimble enough.

technobabbler · on Dec 10, 2021

If you're going closer-to-the-metal cloud, then yeah. I think that's part of the allure of newer, more abstracted cloud hosts like Vercel, Netlify, Gatsby, etc.

They take care of all the infrastructure for ya, and if something breaks, you just twiddle your thumbs while they twiddle the dials.

Of course this sort of laziness is not a valid approach if you're a huge enterprise or running some live-saving project that needs seventy nines of uptime, but for non-critical websites, it's a great sigh of relief...

ksec · on Dec 10, 2021

>making downtime somebody else's problem.

It is even more, making downtime's responsibility somebody else's problem. You can now simply point to AWS is down, AWS is slow, AWS is causing error, and there is nothing we can do about it :). And as long as management knows everyone is having the same problem they are perfectly fine with it.

SahAssar · on Dec 10, 2021

Do people not take responsibility for their choice of vendors? If boeing crashed a bunch of planes and said "Not our fault, our subcontractor made shitty software" would we accept it? If a startup leaked all their customers data would we accept "Blame this other company, we just gave them all our data for analysis" as an excuse?

If you accept that then it means that we need to investigate and scrutinize the whole supply chain of any product before trusting it and it would also mean that any company switching even the smallest part of their supply chain would require us to reevaluate their product.

When we buy things (either products, services or software) we consider the company selling it to be responsible for the whole and deferring responsibility to a vendor is basically just saying that you were either promising more than you knew could deliver or that you did not do a good job picking vendors.

greedo · on Dec 10, 2021

It used to be "No one got fired for buying IBM." Now it's AWS. Many CTOs are very risk averse, and having someone to blame is incredibly valuable to them. My company has a large multi-region set of datacenters, but we're using AWS for a lot of new projects, despite being able to deliver similar services with better performance and cost, on-prem.

Some of the reasons are no one wants to be seen using old tech; stuck being the last COBOL dev. So there's a big incentive for developers and their managers to be using the latest tech; it looks good on the devs resume and grants them job security in the field, and it makes their managers look like they're sharp and on top of new technology.

Also, recruiting people familiar with AWS services is often easier than finding someone versed in on-prem tech. And for companies in less than desirable locations, this helps remove a staffing issue since it can all be done remotely.

And when something goes wrong on-prem, the CTO is expected by the rest of the C-suite to fix the issue. When it's a reputable 3rd party, he can absolve both himself and his team of responsibility.

SahAssar · on Dec 10, 2021

> he can absolve both himself and his team of responsibility.

Everything up until this sentence makes sense in a twisted sort of way, but that's the part I don't get. When did we absolve companies from their choice of vendors? If I buy a service and that service does not work then the company selling that service does not get to blame their vendors, that they chose, just like they don't get to blame individual devs that they hired.

greedo · on Dec 10, 2021

The company trusts the CTO; the CTO trusts AWS, and if AWS goes down, he can deflect responsibility onto AWS. The other C-Suite officers are largely ignorant about tech (that's why they have a CTO), and nod their heads.

ksec · on Dec 10, 2021

>When did we absolve companies from their choice of vendors?

>just like they don't get to blame individual devs that they hired.

When the vendors become a symbol, a public utility status or practically the only choice. And this isn't a Tech company thing. It can be applied to all other industry. You can fire and hire another Dev. If you are looking at cloud vendor, you only have AWS, Azure and GCP to choose. For lots of reason the latter two may not be an option due to competition or features. And AWS is the best you can get, are you sure the Dev they have are the best they can hire?

SahAssar · on Dec 10, 2021

> a symbol, a public utility status or practically the only choice

The first, I can agree with, but the latter two are definitely not the case. If they were a public utility they would be regulated as one and if they were they only choice there would not be companies online during these outages. The site you are writing this on is an example, an online service with a pretty large (although not massive) following that is online enough to be the place where all the people who work at companies that depend on cloud services come to discuss their outages.

ksec · on Dec 11, 2021

>Public Utility

It is not whether they are one as in business or one as in politics that has to be regulated. They are utility like as people treated it as one and it behaves as one. Many things goes down with it with AWS just like many things goes down with the grid.

HN is comparatively tiny and simple. Doesn't requires any of the features on AWS. And has no reliability requirements like SaaS. HN doesn't even use CDN. Even though I am not a cloud advocate for 99% of cases it makes zero sense to build a SaaS with your own Server.

technobabbler · on Dec 10, 2021

Life-critical industries like airlines, cars, healthcare, etc. are probably the exception?

If a restaurant ran out of some menu item for a few days because a supplier didn't have a good harvest, eh, it happens. If a bookstore doesn't stock a certain book, no big deal. Grocery store self-checkout broken yet again? Oh well. Band got too drunk and can't perform? Reschedule it. No snow at the slopes? Climate change.

In the real world, shit happens, and most of it isn't critical.

If you offer a 99.999% guarantee and your customers pay for it, well, you better deliver -- subcontractors or not. But many businesses don't need to do that and a few hours of downtime a year is just a minor inconvenience. To go from "a few days of downtime" to "a few hours" is really easy; just about any provider can offer that. To go from "a few hours of downtime" to "a few seconds" is a huge investment and not worth it to many folks.

tuldia · on Dec 10, 2021

> You can now simply point to AWS is down

Heck, if that was even possible... "everything is green" dashboards... :-)

technobabbler · on Dec 10, 2021

Obviously, the answer is to pay for a cloud dashboard-of-dashboards that show orange lights when AWS's is falsely green.

"We need this new fancy dashboard to monitor our other dashboards in one place, more accurately!"

emptybottle · on Dec 10, 2021

It's not "weasels" scampering out of the wood work.

This is a community of highly technical users expressing legitimate frustration and concern that the biggest player in cloud hosting is

  a) having a service impacting event, and
  b) the status page says everything is up

That's why people talk about the beneits of alternate hosting arrangements. If the status page said what was happening, there would be no need to ask "is AWS down?" the status would be clear.

ignoramous · on Dec 10, 2021

> oh boy! Time to rethink this whole cloud thing!

The problem isn't the cloud, the problem is all of the web properties going down all at once, exacerbated by the fact that AWS charges a premium for any viable cross-region / multi-cloud architecture (given its relatively high egress fees). This is discounting the inter-dependence among AWS services themselves, some of which aren't multi-region (afaik).

PragmaticPulp · on Dec 10, 2021

> I expect better of the community here. All it takes is a chance to take a cheap shot at one of the “big boys” and then all of a sudden the weasels come scampering out of the wood work.

Engineers like to build stuff. Historically, running your own infrastructure and configuring servers from the ground up were the building blocks of the internet. These new cloud services took away those building blocks and they had the nerve to charge for it.

In the real world, there are more interesting problems to solve than setting up and maintaining your own servers so it isn't really a complaint. It's actually more fun to get the hosting stuff out of the way and focus on the problem at hand.

HN comments are basically notorious for exaggerating the downsides of hosted services while overplaying the ease and benefits of DIY. A lot of the comments here are similar to the famous Dropbox comment thread where users couldn't understand why anyone would want to use Dropbox when they could simply set up a complicated self-hosted rsync contraption to sync their own files.

doublerabbit · on Dec 10, 2021

Serverless is just buzzword for a container on a virtual machine on a server. The Cloud is just a buzzword for a company with a data centre selling virtual servers from other big servers.

> You’re either so new to this stuff to have no experience to remember the days before cloud

I do, and it was much more peaceful.

Everything you can do on "in the cloud" you can do in colocation. Which in the long-run is cheaper, more secure* and it's yours! Including the data.

There are caveats, network, component failure. Investment in to these and you can have a pretty king setup.

The cloud enabled magnitude of email spam, brute-forcing, botnets, security vulnerabilities and much more. Operators are lazy don't want to combat it. Providers are bias, then again you can say that about any business.

People flock to the cloud like it's the greatest thing, when all your buying in to is a expensive price-plan for a company who will happily knock you off their service if you somehow brush up the wrong way and then charge you for a closed account.

I do laugh when something goes wrong for FANG. Partly, I'm cynical and want to see the world burn, but it's also these companies exploit their userbase, their staff and the environments resources. When Facebook locked themselves out of their own offices due to the BGP issue, now that's funny.

My colocation costs are:

$5000 covers for a three-year 1Gbit 2u server racking space in two different DCs. Where I have full-control, as many services as I desire, allowed to host what I desire and where the internet space is actually mine. It may be a small cube of internet but I know it's my network, my traffic.

Cloud is whitewash for me, I won't buy in to it. It has it purposes and if your happy with it, fine. But for me Colo for life.

gerbilly · on Dec 10, 2021

Preach it brother!

Also, you should consider renting out space on your cube.

You could even design APIs around the internet protocols so they can create their own DNS records, mailboxes. Wait...

hintymad · on Dec 10, 2021

I'd like to remind everyone about Uber's experience: no EC2-like functionality until at least 2018, probably even now. Teams would negotiate with CTO for more machines. Uber's container-based solution didn't support persistent volumes for years. Uber's distributed database was based on friendfeed's design and was notoriously harder to use than DynamoDB or Cassandra. Uber's engineers couldn't provision Cassandra instances via API. They had to fill in a 10-pager to justify their use cases. Uber's on-rack router broke back in 2017 and the networking team didn't know about it because their dashboard was not properly set up. Uber tried but failed to build anything even closer to S3. Uber's HDFS cluster was grossly inefficient and expensive. That is, Uber's productivity sucked because they didn't have the out-of-box flexibility offered by cloud.

Now, how many people think they can do better than Uber?

greedo · on Dec 10, 2021

Most of those issues sound like management/process issues, not technical issues. Our staff can't provision AWS instances without extensive review...

unethical_ban · on Dec 10, 2021

You're correct that the knee jerk reactions are a bit over the top, but what is so interesting is how little the benefits of cloud are discussed here on HN. People who haven't used cloud native services think it's just EC2 and someone else's computer. But the idea of compute and data all being controlled through a single set of APIs, controlled by the same security syntaxes, is pretty incredible. Of course I'm a fan of the idea of on-prem alternatives, but the unified architecture is nice.

dudeinjapan · on Dec 11, 2021

Disagree on two counts.

1) Like any utility, the cloud is predicated on the assumption that it never goes down. If AWS itself goes down, it causes knock-on effects more most mid-sized (i.e. 1 region, multi-AZ) services on their platform, so we can't have it.

2) If we're going to have to accept downtime (1) then yes, it IS time we rethink the cloud--either how we can achieve no downtime, or how we achieve fault-tolerance in some other way.

stjohnswarts · on Dec 10, 2021

There is nothing wrong with pointing out the warts. I wonder if you've ever been at the mercy of google/aws and their lack of customer service if you aren't representing the business $$ of a fortune 500 company?

cosmotic · on Dec 10, 2021

I don't think anyone is saying they could do better than AWS at making their own infrastructure. I think the complaint is that it's a single point of failure for a huge swath of the internet.

on Dec 10, 2021

[deleted]

reidfromhome · on Dec 10, 2021

There never was an outage. It was just the marketing home page for aws.amazon.com that was down for about 30 minutes and everyone had a knee jerk reaction to “haha AWS down went down again!!” Everything that mattered works fine.

dimmke · on Dec 10, 2021

Thank you for saying this.

SCHiM · on Dec 10, 2021

I think this adds some momentum to the pendulum swinging back the other way. Maybe cloud teams can patch your services better than your in-house team can (See 2 critical issues in Azure the last 3 months, _caused_ by MS itself).

Maybe the cloud has a higher uptime than your on-premise infrastructure (see the AWS, Azure outages). Make sure to compare the actual outage time v.s. the stats doctored by various political pressures and weaselly worded SLAs (how do you mean you had an outage? Only 49% of your requests were failing!).

tinalumfoil · on Dec 10, 2021

Even if you can manage more uptime on your own than through the cloud (which I doubt), being on the cloud means downtime is correlated with downtime of other services. That's usually a good thing.

Your customers will be more understanding if your outage is part if a wider outage that makes national news.

Any services you integrate with are likely down too. If two services with 99% uncorrelated uptime together drops to about 98%. It doesn't drop if the downtime is perfectly correlated.

Even if you don't directly integrate, your customer's workflow might. They see many services down and say, "cool, time to get caught up on laundry". If only you're down it's more aggrevating.

SCHiM · on Dec 10, 2021

> That's usually a good thing.

For any individual company able to offload the blame, that's great. It's not so great if half the countries' doorbells, robot cleaners, various home streaming service setups, the baby camera, the fridge, the smart TV and your phone stop working... All at the same time.

In my opinion, the 'downtime' really should be measured in $NUM_SERVICES_STOPPED X $TIME, instead of just $TIME. And in this case I think any long time Amazon outage is orders of magnitudes worse than your regular old slow IT company outage.

5e92cb50239222b · on Dec 10, 2021

Wha? We host everything on our hardware (which is nothing special) and haven't had any downtime in this year (yet). And we're just another run of the mill dev shop, very far from "superstars" who work on these (supposedly extremely stable) platforms.

throwaway984393 · on Dec 10, 2021

When I worked as an IT person for a cheap hotel back in the day, I set up a single Compaq PC as the only server (Samba NT DC, file sharing, IP masquerading, web/mail/DNS server, etc) in a closet. It was not on any battery backup, there was no modem. It ran for years without ever rebooting. In a city where power outages were normal.

Similarly, I've had EC2 instances run for years without ever being rebooted or going down. One of them's still running today after 5 years.

But none of those services were being used 24/7; if the internet went down for an entire weekend, or a hard drive was a little bit corrupt but kept running programs in memory, I probably would never have noticed. I've also had EC2 instances literally just fall off the map and sort of disappear, and had them manually replaced without notice by AWS, had virtual drives fail and corrupt, and had calls to services fail. And I've had my desktop's power supply get fried by a power surge.

Without a lot of experience, running systems seemed easy. But as time went on I learned that it can be easy, and it also can go down if somebody blows on it the wrong way. What we see as being reliable may just be chance. The only way to guarantee reliability is to expect that things are going to go down, and design and build it accordingly.

What AWS makes easy is they give you all the components for reliability, but you have to do the plumbing yourself. I'll bet you the people whose services went down did not properly design for reliability, as they were probably running in one region, in one set of AZs, and relied on distributed system operations that can fail, and didn't properly account for how to deal with those failures. One product I maintain on AWS did not go down, but another did.

Also, the more components a system has, the higher probability there is of failure. Big systems are actually more error prone than small ones.

oxfordmale · on Dec 10, 2021

Two questions I have regarding your in-house hardware:

1. How easy can I access your physical servers ? 2. What happens if there is a catastrophic failure, for example local power outage or a major flooding 3. How secure is your server? Are you regularly patching your operation systems 4. If I want to run a project that requires double the capacity of your current hardware for a specific project, how long is it going to take to get it spun up?

christophilus · on Dec 10, 2021

I'm running about 10 servers myself in production. They just do transcoding, so aren't mission-critical, but...

- Regularly patching is automated and took about 30 seconds to configure, using an automated script

Regarding running your own physical servers, that is a different ballgame, but for all of my projects, if I need to:

- I can pretty easily spin up VPSs / bare metal servers anywhere (netcup, linode, hetzner, etc) and provision there while I wait for new hardware to come in - If you want to double the capacity of your current hardware, you'll have to order it and wait, but it's cheap (vs the major cloud providers) to way over provision if you're running your own physical hardware, so you can pretty easily have 2-4x extra capacity and still come out with extra money in your pocket.

I host in the cloud, but I think people vastly over estimate how much it saves 90% of cloud customers.

grey-area · on Dec 10, 2021

There are many approaches which don't depend on AWS, and not all of them mean hosting your own physical servers, and they certainly don't mean you don't have an off-site backup policy.

There are well-understood answers to all your questions, they are not too difficult, they just cost money - some businesses choose not to spend that money, some weigh cost-benefit and go for AWS, some decide to go for in-house servers, some go for hosted Virtual Servers, some go for serverless.

Why does everything have to be built one way?

gregmac · on Dec 10, 2021

> they just cost money - some businesses choose not to spend that money, some weight cost-benefit and go for AWS, some weigh cost benefit and decide to go for in house, some go for hosted Virtual Servers, some go for Serverless.

I'd also say some choose not to spend the money, but fail to consider the cost of that choice.

For example: Doing old-school manual deployments that require herculean efforts to update at off-hours on the weekends burns people out, and makes it hard to attract new talent. In other words, you've made the decision to spend more money on finding and retaining people. But it's definitely way cheaper to pay for colocating a single Dell server you bought 3 years ago than what you'd spend in the same time on AWS.

And if your hardware never dies, paying for the redundancy might seem silly. A lot like paying for fire insurance despite the fact your house has never even burned down.

grey-area · on Dec 10, 2021

Your choice of straw man here is interesting.

Of the options I mentioned, you seem to have picked only self-purchased hardware with no redundancy and no backups (your addition) to compare with, and also throw in manual deployment (why?).

Most businesses are somewhere between a forgotten old dell server in the closet and fully hosted multi-region auto-scaling fully bought in to AWS, and that's OK!

oxfordmale · on Dec 10, 2021

i am not saying everyone needs to go for AWS and there are many good reasons for self hosting However, if the only reason for self hosting is a higher uptime than AWS or GCP, it is general false economy, and you are likely taking short cuts that you are not aware of.

grey-area · on Dec 10, 2021

This is a false dichotomy (I listed many options, not just self hosting, not many businesses fully self-host these days), and uptime is provably significantly better away from the big hosted services which are constantly churning features, config and hardware resulting in outages for everyone hosted on them.

Am_I_Right · on Dec 10, 2021

Not the person you're replying to, but...

1: Access to the data center requires either an access card plus biometric verification (fingerprint in one DC in my case, retina scan in the other), or an ID-verified appointment. Then, you still need to know where my server is, plus the access code for the rack (or, you need to be an average lockpicker, but beware, there's cameras...).

2: Each data center has dual-provider AC feeds, plus generators, and provides A/B feeds to my rack. I've not had a dual-feed outage in the last 20 years or so.

3: No cloud provider that I'm aware of guarantees server security or does automated patching (for servers, not services). So, keeping your server up-to-date seems equally important for both cloud and non-cloud scenarios?

4: At least two weeks, I guess? I have sufficient VM host capacity to accommodate 30% unplanned growth, but 100% would require new hardware. So: ordering two servers, installing these in two data centers. But if the new project also requires significant bandwidth, getting new Internet connections in might take longer.

Look, I'm definitely not denying that "the cloud" makes it easier to scale fast, but scaling fast is not an overriding concern for most businesses. Cost is, and self-hosting, even with a pretty redundant infrastructure, is still much cheaper than AWS.

y4mi · on Dec 10, 2021

> Even if you can manage more uptime on your own than through the cloud (which I doubt)

Getting higher uptime is super easy for smallish inhouse deployments.

You just don't install any updates and let the server run, trusting your VPN to shield you from possible security issues.

The maintenance burden is the reason why people often prefer the cloud services, not the uptime. Because maintaining the instance with updates, reading all patch notes and steps for migration, keep every health metric monitored and respond quickly on issues without getting stuck googling for possible reasons is quite a bit of work and quickly forces you to employ n+1 people.

api · on Dec 10, 2021

This is particularly true with modern fully solid state hardware. Uptimes of decades are likely possible if you don't mess with anything.

We run some bare metal servers. They just never go down. Solid continuous pings for years as monitored from elsewhere on the Internet. That's because they're just boxes on a rack somewhere running an OS and some steady-state services (ZeroTier roots). Simplicity is more robust than complexity.

SaaS is definitely about the pain of managing and upgrading software, but it's also about OPEX vs CAPEX. Many companies will pay more for things to put them in the OPEX column for various entirely synthetic accounting, investor relations, and tax reasons.

I do wonder if the pendulum there will swing back though since if you price out cloud vs. physical hardware the market has become extremely distorted. Many companies spend enough on AWS to buy an entire rack of hardware at a different data center every month and pay 2-3 employees to manage it. That hardware would be up to 100X as fast and powerful as what they rent at AWS and bandwidth would be almost free. That's a really distorted market. The amortized costs should not be this different.

carlmr · on Dec 10, 2021

>Many companies spend enough on AWS to buy an entire rack of hardware at a different data center every month and pay 2-3 employees to manage it. That hardware would be up to 100X as fast and powerful as what they rent at AWS and bandwidth would be almost free. That's a really distorted market. The amortized costs should not be this different.

I think the issue here is that it's not zero-cost to switch. Your processes will adapt to some implicit assumptions that aren't true outside AWS, Azure, or whatever vendor you locked yourself into.

If we somehow managed to have a completely standardized interface here, the market would be more competitive.

api · on Dec 10, 2021

Things like OpenStack are promising if they can become easier to install and run.

https://www.openstack.org

tatersolid · on Dec 14, 2021

It’s been a decade. Openstack is still a disorganized, unstable mess. It’s not gonna happen if it hasn’t by now.

rstupek · on Dec 10, 2021

That is true. But hope a drive by ddos doesn't pick you that day!

y4mi · on Dec 10, 2021

How do the attackers ddos an inhouse deployment that cannot be accessed from the internet without joining a VPN?

They can try to saturate the VPN host maybe, but that's going to be challenging considering that it's going to be limited to connection requests without valid credentials.

and these are likely set to be ignored on multiple failed attempts through fail2ban or similar tooling

rstupek · on Dec 10, 2021

You have a connection to the internet with an IP address do you not? That connection can easily be saturated with traffic and fail2ban would not have a chance to do anything.

oxfordmale · on Dec 10, 2021

Zero Day VPN vulnerabilities

gerbilly · on Dec 10, 2021

> being on the cloud means downtime is correlated with downtime of other services. That's usually a good thing.

A long time ago, I used to circulate snarky little emails at work.

One of them was responding to this very concept.

My managers were throwing out our working and mature UNIX servers (implementing DNS, Mail, and other services) in favour of NT.

The new system crashed a lot, we had some security breaches, but at least it was 'industry standard.'

Managements' response was that with the UNIX stuff we had no one to pin our outages on. Now we could blame Microsoft, and call their support line.

I circulated an email making fun of this justification, which promoted a fictional product called 'Blame Studio' which would help you map out the blame path for any of your products or services.

It would help to make sure that none of the blame ever landed on you, but rather was always redirected onto some other company.

ocdtrekkie · on Dec 10, 2021

My desktop PC has better uptime than the cloud right now. So does my datacenter. In fact, I am not sure I have anything electronic that doesn't have better uptime than AWS or Azure.

bob1029 · on Dec 10, 2021

I'd rather have control over my circumstances than the ability to assign blame for them.

"my outage means the customer is probably also down so they maybe don't care" is not a viable way to run a business in my mind.

albertopv · on Dec 10, 2021

An our on premise linux server recently reached an uptime of 1000 days. Yes, days.

jve · on Dec 10, 2021

Do you not patch your servers? Or you don't need reboot for Linux devices?

albertopv · on Dec 17, 2021

I dont know, we received a happy email from our IT team to celebrate 1000 days of uptime just few days ago.

heyitsguay · on Dec 10, 2021

Can't speak to their particular config, but at least LivePatch for Ubuntu can apply most updates without the need to restart.

ipaddr · on Dec 10, 2021

Why would I want to be down when everyone else is down? That's a period where I can get customers from everyone else.

justbaker · on Dec 10, 2021

> Your customers will be more understanding if your outage is part if a wider outage that makes national news.

Yes they definitely are.

savant_penguin · on Dec 10, 2021

Or they could see their entire product line break at the same time, in such a way that they cannot mitigate

jve · on Dec 10, 2021

If you happen not to be on the major cloud platform where everybody is enjoying donwtimeshare, we provide you with some downtime monkeys that will pull the lever EXACTLY when your platform should go down!

Let no more have your integrations with systems in the cloud frustrate your customers when they are unreachable - let the news explain the downtime. JOIN the Downtime Umbrella NOW and receive 5 downtime lever pulls for FREE! /s

spamizbad · on Dec 10, 2021

One problem facing on-prem orgs today is the sad state of commodity server hardware; poor quality control, buggy firmware (and vendors who won’t help), BMCs with massive attack surfaces (that have already claimed one VPN provider), and commodity network switches that can push lots of packets but aren’t terribly flexible for people who are pushing more challenging payloads around their DC (video, etc).

"Hyperscalers" like Amazon, Microsoft, Facebook and Google build their own hardware and are able to avoid many of these problems. Unfortunately, none of this stuff is available off the shelf to mere mortals.

There’s a startup trying to fix this problem (Oxide) which I think is launching their racks next year. Will be interesting to see what happens.

tuldia · on Dec 10, 2021

The learned helplessness in the cloud is stupefying, so many outages and downtime that could have been avoided by a competent admin.

nabla9 · on Dec 10, 2021

Also, not all uptime is equal or worth of the price.

What would be more valuable to you:

1. 99.5% uptime where unscheduled downtime is max 10 minutes vs.

2. 99.5% uptime where unscheduled downtime comes in 2-5 hour chunks.

ocdtrekkie · on Dec 10, 2021

Another key point of this: Selecting your downtime. If you own your own stuff, your major changes and maintenance are done at times favorable for your business. AWS or Azure configuration fat-fingers happen when best for their business, not yours.

jjav · on Dec 10, 2021

> Maybe the cloud has a higher uptime than your on-premise infrastructure (see the AWS, Azure outages).

Not even so sure about that. I've had a ton more downtime ("degraded" in AWS speak) with AWS than any self-hosted systems. And that's with more than half my career on self-hosted.

If a major disaster strikes, like the whole rack catching fire and melting everything, then it's true that AWS could recover quicker than self-hosted. But most problems are not of that sort.

gtirloni · on Dec 10, 2021

I don't see any momentum in that direction at all. If anything, it's quite the opposite.

These outages have almost zero impact on anyone deciding to use a cloud provider.

csw-001 · on Dec 10, 2021

Agreed. In fact, I've heard government decision makers actually swayed to go to the cloud due to outages with logic along the lines of, "If Amazon can't keep AWS running, what chance does our 30 person, chronically underfunded, team have?" There is also safety in having the outage be somebody else's fault, and having that outage effect a huge swath of the economy - "Sorry customer, but it's not just us." Reminds me a bit of the "nobody gets fired for picking IBM" mentality.

raffraffraff · on Dec 10, 2021

I've been getting the "We're sorry!" error for at least 15 minutes. No idea about regions or anything, I was just about to look up their docs about moving accounts between orgs etc. Anyone affected?

Edit: Seems to be flapping between the 'sorry' error and a blank page. Thoughts and prayers with the SREs, if they call 'em that over there.

bierjunge · on Dec 10, 2021

Same here. And as always, AWS status page [0] is completely green and useless.

[0] https://status.aws.amazon.com/

my123 · on Dec 10, 2021

Management console seems to be accessible fine to me, maybe it's just the aws.amazon.com website that's down.

raffraffraff · on Dec 10, 2021

Odd that a major landing page like aws.amazon.com isn't super HA. At a previous $work, the "marketing" page (ie: main dot com site) was horribly mis-managed and had countless outages due to various embarrassing reasons. Every time it was down, customers complained that the product was down (it wasn't, but people used the dot com page to get to the login page).

johnisgood · on Dec 10, 2021

Which means people suggesting one another to compare their downtime with AWS' should stop suggesting it, as you cannot rely on their data at all.

reidfromhome · on Dec 10, 2021

The only thing that was down was the marketing home page at aws.amazon.com, which is not a “service” and is not on the SHD. All of the actual services are fine, hence the green.

Thaxll · on Dec 10, 2021

Because everything works fine? So a landing page does not work you need to put defcon 3?

ziddoap · on Dec 10, 2021

How on earth is putting a little red X or whatever next to the service that isn't working, when it isn't working, defcon 3?

Thaxll · on Dec 10, 2021

Landing page is not on the aws status page.

ziddoap · on Dec 10, 2021

And if it was, how is marking it as down a defcon 3 situation? My point was more that your original comment is a gross exaggeration.

gerbilly · on Dec 10, 2021

The whole thing reminds me of the early nineties.

There was a whole mature infrastructure built around UNIX and open standards at the time.

We were told to scrap that and replace it with Windows NT and its descendants.

While UNIX wasn't perfect¹, Windows was an order of magnitude worse.

Compared to our UNIX machines, those systems got the internet protocols wrong, were full of security holes and crashed all the time. Objectively worse performance and security was tolerated for a decade or more, just because we all convinced each other that this was the way things were going.

It took Microsoft maybe ten fifteen years to work their problems out. And now there's a Linux env built into Windows.

This cloud thing isn't the end of computing history, just like Windows on the server wasn't the end of history.

Cloud isn't better than on prem. It's just popular right now.

1: https://en.wikipedia.org/wiki/Morris_worm

tr33house · on Dec 10, 2021

The worst part of AWS outages is that their status page is always green. It's OK to fail but it's not OK to hide the failure and expect all the customers to stumble upon the failure especially for a company as big as AWS. Not OK!

x0xrx · on Dec 10, 2021

I wonder why the error message is also in … French? (I’m not located anywhere French speaking).

> Désolés!

> Une erreur s'est produite lorsque nous avons tenté de traiter votre requête. Soyez assuré que nous travaillons déjà à la résolution du problème que nous pensons trouver très rapidement.

jpgvm · on Dec 10, 2021

Mine is in Japanese:

> 申し訳ありません。

> リクエストの処理中に問題が発生しました。現在問題を調査しておりますので、解決するまでもう少々お待ちください。

themodelplumber · on Dec 10, 2021

[flagged]

Waterluvian · on Dec 10, 2021

I don't follow. It seems like an almost exact translation to all the other languages for this error page.

themodelplumber · on Dec 10, 2021

You've never pondered the beauty and provenance of phrases like moushiwake arimasen or omachi kudasai, I take it?

I mean, you could say they mean we're sorry and please wait, but that's also just wrong in a whole other way.

tjpnz · on Dec 10, 2021

Japanese has multiple politeness levels, in this particular case the message is using keigo which is the most polite. At a guess I'd say that the machine translation is identical all the way down.

Waterluvian · on Dec 10, 2021

Thanks. I know nothing of Japanese so this was confusing to me.

I'm thinking about it, and I guess there's "politeness levels" in English, but probably not as much so, and usually in a form of passive aggresisveness.

That's fascinating. So the "style" of language they use has a huge impact on the message, possibly more so than the actual words? Very McLuhanesque :)

themodelplumber · on Dec 10, 2021

You even conjugate the verbs differently for your customers.

Richard Feynman talked about this point really frustrating him and turning him off Japanese entirely. He just wanted to use words to get things done...granted for informational purposes it's nice to not have to conjugate the verb for download based on who downloaded it.

Waterluvian · on Dec 10, 2021

I kind of get that sentiment. Fundamentally I think it's beautiful and I'm delighted by such variety in human languages. But I spent 10 years learning French and it drove me nuts how many different ways you conjugate verbs.

tapan_jk · on Dec 10, 2021

I am in India and my message is in English and Japanese. Never seen a Japanese message on aws before.

myth2018 · on Dec 10, 2021

I'm in Brazil and also read the message both in english and french

markus_zhang · on Dec 10, 2021

I'm in Canada so sort of make sense to show bilingual page.

julianlam · on Dec 10, 2021

Unless you're Quebecois, in which case the province would prefer French appear first. _smirk_

leto_ii · on Dec 10, 2021

Same in the Netherlands.

ff7c11 · on Dec 10, 2021

${jndi:ldap://AWS Down}?

desdiv · on Dec 10, 2021

For those out of the loop: this is a reference to the Log4j RCE story that's also currently on the HN frontpage.

Amazon uses a lot of Java, so the two stories might be actually connected somehow.

[0] https://news.ycombinator.com/item?id=29504755

netsec_burn · on Dec 10, 2021

Spoiler: They are. AWS and Amazon both used Log4j exploitable from the homepage (/) of both sites through headers.

Source: me

trevorriles · on Dec 10, 2021

I'm still able to log in to the console. Seems to just be the AWS landing page that is down.

tyingq · on Dec 10, 2021

Me too. Though it does seem to be any page for that specific domain. They all return an HTTP 500 (Internal Server Error).

Google "site: aws.amazon.com" and try any of the links.

PedroBatista · on Dec 10, 2021

Only the most informed and rational companies will see how and why most of this "in the cloud" thing is a bad idea.

Most will accept it as a fact of life and continue to pay for it both directly and indirectly, as long as there's cheap money going around the cloud business can't do wrong.

But it's hilarious to see people indulging in byzantine "World scale" resilient systems that depend on a single vendor.

eterm · on Dec 10, 2021

On the contrary, downtime is inevitable and I think I'd rather have downtime when all my competitors do too.

cnity · on Dec 10, 2021

How about having uptime while your competitors have downtime?

the_af · on Dec 10, 2021

If your system integrates with any external systems or APIs, it's likely they have downtime when one of the cloud giants are down, so sometimes being up is not so relevant. If your system is up but everything you depend on is down, how useful is that?

For (almost) entirely self-contained systems it can still be useful, of course. But everything wants to be interconnected to everything these days...

spoils19 · on Dec 10, 2021

That's why we proudly don't integrate with any third-parties at all, and all of our stuff is developed and hosted in-house. Even "trivial" things like email or billing.

scrollaway · on Dec 10, 2021

Which, given that downtime is inevitable, leads to you having downtime while all your competitors have uptime.

Have fun with that…

teitoklien · on Dec 10, 2021

If your non-cloud uptime is even 50% more then the uptime cloud offers, then you’re throughout the year, up longer then others.

leghifla · on Dec 10, 2021

If you are 3 competitors on your market and you might lose x orders while your are down, but gain 2*x orders when the other 2 are down, you are ahead (with the same probability of going down as the others).

Real world is not so easy though...

41209 · on Dec 10, 2021

Would you rather have to fix your own data center, or wait 4 hours.

AWS works 99% of the time,plus it's someone else's problem

El_RIDO · on Dec 10, 2021

I absolutely prefer to have the option to go into a datacenter in a hurry and actually fix stuff and be in charge, then be stuck with having to wait an indefinite amount of time, twiddle my thumbs, apologize to customers and hope for the best.

While I considered myself a decent Windows NT admin, 20 years ago, the reason I went all in on Linux and FLOSS software at the turn of the century was because I dreaded the powerlessness that these proprietary solutions gave me, when they failed. You'd call the vendors, pored through logs, finding obscure, undocumented error codes, etc. With FLOSS and self-hosting youve got all the information at your fingertips. And if you encountered bugs and you can dig into the sources, patch them, re-compile and fix things - and share them with others and feel that you're contributing to our profession.

When I got the chance to do cloud projects over the last 5 - 10 years, I always took these opportunities, hoping to ensure that I keep up with the tech. At first I was hopeful to offload the boring ops tasks, take our config management to the next level and automate even more. With every platform I got to work with, AWS, Azure and GCP so far, we kept finding bugs in their APIs, outdated or otherwise incorrect documentation and very unpredictable performance, unless you actually can run stacks at scale to average it out (as in more then just 10 - 20 instances or a larger clustered SaaS of the cloud vendor). Many times we also encountered undocumented limits that required requesting support and waiting for approval by the cloud vendor, to get even their mid-sized resources allocated. It all works very nicely on the free-tier-eligible, smallest instances and services, if as slow and high latency as is to be expected, but as soon as you actually need some decently sized storage, compute or bandwidth, it becomes quickly more expensive than what you can put in two or three datacenters for redundancy yourself, if you look at the yearly costs. So far none of the PoCs I was involved with ever got approved long term. They mostly end up as reference implementations for our customers or show case material for the corporate blog. :-(

No thanks, I don't want to go back to feel that powerless as I did on closed systems ever again. Luckily, although many seem to think that cloud is the only option to run at a global scale, you can still provide lots of valuable services on the internet using robust hardware, housed in well connected datacenters.

41209 · on Dec 10, 2021

You may be correct, for some very specific use cases. No matter how much corporate clients want to holler, a few hours of downtime isn't going to hurt anyone.

I know I'd rather someone else do it, then having to drive 2 hours away to a data center at 4:00 in the morning. But I don't know exactly what you're working on , I definitely can't imagine some use cases where a few hours of down time is just unacceptable. I know I wouldn't want to run a logistics firm with servers that go down all the time.

iso1631 · on Dec 10, 2021

In my industry, five nines is the starting level. You're proposing something 1,000 times worse.

sumtechguy · on Dec 10, 2021

If you have to have 5 9's starting with any cloud service is pretty tough sell at this point. You can build that type of system ontop of them. But it takes a willingness to also build your own and use competing services adding even more complexity. The BYO or competing services bit is so you can keep the lights on even when the cloud eats it, not if. Part of 5 9's is planning for catastrophic failure. Things like 'what if the load balancer/router/backhaul/dns burns out, you have a spare on hand? Then what happens to services where you had them sticky?' Even then what if both the primary and secondary are out? What then? Lots of planning and making sure you know what to do when each of those cases happen.

The nice bit though is many services can go down. Yeah it stings a bit (money, reputation, time, etc). But overall it is not that big of deal.

But for the places where you can not go down. Tons of planning and tons of backup plans with backup plans, and a different style of producing code.

Edman274 · on Dec 10, 2021

Besides the fact that they used "99 percent" in a colloquial, informal way ... is that really how calculations like that work, or should work? I mean, if you have 99 percent availability, and 99.999 percent availability, maybe it's valid to say that the former is 1000 times worse than the other, but that could also be true by saying that 99.999 percent availability is 10,000 times worse than 99.9999999 percent availability (where a service only goes down for a third of a second in the span of a decade), and that 99.9999999 percent availability is infinity times worse than something that hasn't had an outage within whatever monitored period (since 100 percent availability is the same as ninety nine dot nine repeating availability)

stjohnswarts · on Dec 10, 2021

Then AWS isn't really targeting you as a customer though. Most of the web really isn't that critical and can survive a few hours outage occasionally.

greedo · on Dec 10, 2021

But for those who bought into AWS as a way of improving their performance and stability, "a few hours outage occasionally" can be very expensive. If my company experienced a multi-hour outage "occasionally,", it would be $100K/hour in lost revenue. Your remark reminds me of people who say "Well, Tesla really doesn't mean `autopilot` in the way you think."

emptybottle · on Dec 10, 2021

You don’t need to build your own data center to host outside AWS.

You can fully manage your hardware and software fleet while still paying a host to provide data center service, and networking too.

jeremyjh · on Dec 10, 2021

Do you think people who manage their own data-centers never have outages?

emptybottle · on Dec 10, 2021

Running your own data center is on the far other end of the spectrum. There are numerous options between solely using AWS and building a data center.

jeremyjh · on Dec 10, 2021

And people have outages in all of them.

marginalia_nu · on Dec 10, 2021

When they do, they at least have the ability to do something about it.

mac-chaffee · on Dec 10, 2021

My last outage was a power outage on a sunny day that lasted slightly longer than our battery backups. Without economies of scale, installing/maintaining extra generators for a small self-managed data center is prohibitively expensive.

There's always a service provider somewhere in the chain who can drop the ball.

whatever10 · on Dec 10, 2021

Like shout at your IT and vent?

smarx007 · on Dec 10, 2021

Which may not be good for your uptime: https://youtu.be/tDacjrSCeq4

tiahura · on Dec 10, 2021

Some us remember playing Chinese fire drill and it wasn’t fun.

jasonlotito · on Dec 10, 2021

No, not necessarily.

tyingq · on Dec 10, 2021

>But it's hilarious to see people indulging in byzantine "World scale" resilient systems that depend on a single vendor.

As far as I can tell, there really hasn't been an AWS outage where you couldn't have avoided issues with multi-region and some careful selection of which products you're using. Which isn't much different from what you would have to do with your own infrastructure.

papito · on Dec 10, 2021

Try calling "your ops guy" during Thanksgiving or when he is sloshed in a bar when your colocation servers stop responding. No one wants this job anymore. Current technology jobs have already enough unnecessary complexity to deal with uptime as well.

emptybottle · on Dec 10, 2021

On the other hand try calling "AWS" in the same scenario. My money is on sloshed guy coming back with the faster response.

Jenk · on Dec 10, 2021

Have you heard about the "No True Scotsman" logical fallacy?

stunt · on Dec 10, 2021

I don't use Cloud for their uptime, it's all about their managed services for me.

Flankk · on Dec 10, 2021

Are you arguing that purchasing and maintaining your own servers is cheaper than using AWS? Let's hear the rationale, then.

mac-chaffee · on Dec 10, 2021

Yes, on-prem is generally cheaper in the long-run.

For example, my company recently purchased $400k worth of hardware that would cost $80k/month if it was all EC2. There's amortized costs of the supporting infra (cooling, power) and ongoing maintenance/depreciation, but paying for the hardware in 5 months can't be beat.

Flankk · on Dec 10, 2021

You forgot the salaries of the network technicians you wouldn't have otherwise. You now have to manage attacks on your infrastructure yourself. You left out bandwidth costs as well. Your servers will need to be replaced entirely every 5 years, and partially as hardware fails.

It sounds like you're at sufficient scale that it makes sense. For most people, it doesn't make sense.

christophilus · on Dec 10, 2021

I wonder if it's the log4j thing... AWS is built on Java, and I'd be very surprised if log4j isn't used heavily there.

trollied · on Dec 10, 2021

I think it is them frantically patching it. There are some Amazon staffers posting in the RCE reddit thread: https://www.reddit.com/r/programming/comments/rcxehp/rce_0da...

> Our slack group for this issue is at 3,400 people, haha. It'd be funny if I wasn't one of them. > Where do y'all work that has 5000 employees on a single issue?? > One that has an arrow under it's name.

darrena092 · on Dec 10, 2021

This still seems to work for me, replace region with your region:

https://<region>.console.aws.amazon.com/

burai · on Dec 10, 2021

Testes this and it works fine. Thanks!

raffraffraff · on Dec 10, 2021

I don't think you said what you think you said

burai · on Dec 11, 2021

*tested

spresso · on Dec 10, 2021

This worked for me too, thanks for the tip!

robomartin · on Dec 10, 2021

Ah, the debate between managing your own infrastructure or renting it from someone else. Always a fun debate, often centering around uptime far more so than cost.

I'll go ahead and make what will seem like a weird comparison. I sometimes think about this as I do about Microsoft Office.

What?

Over the years the MS Office suite has reached an asymptotic limit with regards to functionality users would actually be willing to pay for. I'll say that somewhere between 2013 and 2016 the suite had everything most users would need to do the usual stuff one does with Word, Excel, PowerPoint, etc. In other words, it became good enough and maybe even better than good enough.

This is the parallel I see with Linux servers and the Internet infrastructure needed to run a reliable site or service. Not just the OS, but the system as a whole has, over time, approached an asymptotic limit on functionality, reliability, uptime, ease of use, capacity, bandwidth, management, etc. Not sure we are at that limit yet. It sure feels like we must be close. As many have echoed, today one can deploy a range of servers in multiple configurations and achieve extremely good uptimes with as much functionality as needed without necessarily having to go to cloud service providers.

Is it the same? Not sure. I haven't operated at the highest scales in building web services, so I don't know. My gut feeling is that short of things like seriously large DDoS attacks, a self-owned infrastructure could do just as well as something hosted by a cloud provider. With knowledgeable engineers running the show there should not be any issues in setting-up, managing, maintaining and supporting such a structure.

Note that I am not hating on cloud services. No. What I am saying is that because technology tends to get better over time, it is only natural that the alternative approach has gotten better and better to the point that it would be perfectly sensible for a business to consider rolling their own rather than automatically resorting to cloud providers.

kchoudhu · on Dec 10, 2021

Yup, my traders are getting the "sorry!" error when trying to log into QuickSight, so looks like another great day at AWS.

redwood · on Dec 10, 2021

Still no post mortem from Monday's right? We're all still in the dark?

davewritescode · on Dec 10, 2021

When I've worked with AWS and other providers before, they've typically given early indications about potential root causes under NDA. If you're a small customer, you probably get to wait with the rest of us but the big folks probably have some idea of the root cause.

tr33house · on Dec 10, 2021

Can confirm this

naikrovek · on Dec 10, 2021

> Still

it's usually a month or more after a large outage to see the full breakdown on what happened. people expecting to see it the same week are kinda not living in reality.

redwood · on Dec 10, 2021

Fair I guess I would have expected more in the way of official comms though. Most of what I've seen is based on deduction and speculation rather than from the horse's mouth

404mm · on Dec 10, 2021

Status page is all green. All must be ok! /s

kl4m · on Dec 10, 2021

https://console.aws.amazon.com up.

FrankSansC · on Dec 10, 2021

(from France)

at first:

  We're sorry!
  An error occurred when we tried to process your request. Rest assured, we're already working on the problem and expect to resolve it shortly.

and now:

  This page isn’t working
  aws.amazon.com is currently unable to handle this request.
  HTTP ERROR 503

artembugara · on Dec 10, 2021

Guess whos status page is all green?

hdjjhhvvhga · on Dec 10, 2021

I know it is not recommended, but for small (-to-medium) websites you can use DNS failover. I run non-critical services but still I like to have above-average uptime, so I have them configured in two separate physical locations by two different providers. From the point of view of visitors, they had zero downtime in the last 7 years, in spite of several hardware failures (disks, motherboard, power supply - all resolved within an hour). I know it because I tested it multiple times from various location checking what happens if one of them dies.

Of course if you have millions of visitors a day, you'll do better with a more robust setup, but mine costs me €80/month.

Sosh101 · on Dec 10, 2021

It seems like AWS and Google have had more outages than normal recently. I guess this is to do with COVID disruptions? (remote working and such).

acdha · on Dec 10, 2021

This morning I’d bet more on a rushed log4j patch. They use it heavily.

x86_64Ubuntu · on Dec 10, 2021

It's kind of terrible that a logging library can lead to such downtime.

bushbaba · on Dec 10, 2021

When you need to force update all of prod asap. Yes it is terrible. The fact log4j even does remote network calls is crazy

stjohnswarts · on Dec 10, 2021

why would remote workers matter? AWS is basically all remote.

Sosh101 · on Dec 11, 2021

Because change can be disruptive.

rswail · on Dec 10, 2021

I just logged in via standard console and things seem to be fine.

That direct URL aws.amazon.com gives me a weird error, but everything else seems to be fine.

FerdSlav · on Dec 10, 2021

Have a couple LightSail instances that all went offline around the time that the homepage did and still cannot connect. Everything in the dashboard says they are healthy and online but all attempts to access them (either just access the webpage or SSH) timeout.

Edit: SSH access has been restored if using their web client, everything else still times out

woeirua · on Dec 10, 2021

Amazon’s going to have to start paying out on some SLAs if this goes on for much longer…

zwily · on Dec 10, 2021

There’s no SLA for the console, IIRC.