Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.
Similar experience here. People laughed and some said something like "well, if something like AWS falls then we have bigger problems". They laugh because honestly is too far-fetched to think the whole AWS infra going down. Too big to fail as they say in the US. Nothing short of a nuclear war would fuck up the entire AWS network so they're kinda right.
Until this happen. A single region in a cascade failure and your saas is single region.
They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.
Everyone using Recall for meeting recordings is down.
In some domains, a single SaaS dominates the domain and if that SaaS sits on AWS, it doesn't matter if AWS is 35% marketshare because the SaaS that dominates 80% of the domain is on AWS so the effect is wider than just AWS's market share.
We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.
Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.
Part of the company I work at is doing infrastructure consulting. We're in fact seeing companies moving to bare metal, with the rise of turnkey container systems between Nutanix, Purestorage, Redhat, ... At this point in time, a few remotely managed boxes in a rack can offer a really good experience for containers for very little effort.
And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.
There have been plenty of solutions for a while. Pivotal Cloud Foundry, Open Shift, etc. None of these were "turnkey" though. If you're doing consulting, is it more newer, easier to install/manage tech, or is it cost?
I'm not in our consulting parts for infra, but based on their feedback and some talks at a conference a bit back: Vendors have been working on standardizing on Kubernetes components and mechanics and a few other protocols, which is simplifying configuration and reducing the configuration you have to do infrastructurally a lot.
Note, I'm not affiliated with any of these companies.
For example, Purestorage has put a lot of work into their solution and for a decent chunk of cache, you get a system that slots right into VMware, offers iSCSI for other infrastructure providers, offers a CSI plugin for containers, and speaks S3. And integration with a few systems like OpenShift has been simplified as well.
This continues. You can get ingress/egress/network monitoring compliance from Calico slotting in as a CNI plugin, some systems managing supply chain security, ... Something like Nutanix is an entirely integrated solution you rack and then you have a container orchestration with storage and all of the cool things.
Cost is not really that much a factor in this market. Outsourcing regulatory requirements and liability to vendors is great.
Because your competitor probably depends on a service which uses aws.
They may host all their stuff in azure, but use cloudfront as cache which uses aws and goes down.
>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".
> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
They made their own bigger problems by all crowding into the same single region.
imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.
So too with outages. Safety / loss of blame in numbers.
The question really becomes, did you make money that you wouldn't have made when services came back up? As in, will people just shift their purchase time to tomorrow when you are back online? Sure, some % is completely lost but you have to weigh that lost amount against the ongoing costs to be multi-cloud (or multi-provider) and the development time against those costs. For most people I think it's cheaper to just be down for a few hours. Yes, this outage is longer than any I can remember but most people will shrug it off and move on once it comes back up fully.
At the end of the day most of us aren't working on super critical things. No one is dying because they can't purchase X item online or use Y SaaS. And, more importantly, customers are _not_ willing to pay the extra for you to host your backend in multiple regions/providers.
In my contracts (for my personal company) I call out the single-point-of-failure very clearly and I've never had anyone balk. If they did I'd offer then resiliency (for a price) and I have no doubt that they would opt to "roll the dice" instead of pay.
Lastly, it's near-impossible to verify what all your vendors are using so even if you manage to get everything resilient it only takes one chink in the armor the bring it all down (See: us-east-1 and various AWS services that rely on that even if you don't host anything in us-east-1 directly).
I'm not trying to downplay this, pretend it doesn't matter, or anything like that. Just trying to point out that most people don't care because no one seems to care (or want to pay for it). I wish that was different (I wish a lot of things were different) but wishing doesn't pay my bills and so if customers don't want to pay for resiliency then this is what they get and I'm at peace with that.
I feel like I don't really like AWS and prefer self hosted vps or even google cloud / cloudflare more but and so I agree with what you are trying to say but Let me be the devil's argument.
I mean I agree but what you are saying is that where else are you gonna host it? If you host it yourself and then it turns out to be an issue and you go down then that's entirely on you and 99% of the internet still works.
But if Aws goes down, lets say 50% of the internet goes down.
So, in essence, nobody blames a particular team/person just as the parent comment said that nobody gets fired for picking IBM.
Although, I still think that the idea which is worrying is such massive centralization of servers that we have a single switch which can turn half the internet off. So I am a bit worried from the centralization side of thing's.
It was pretty frustrating to me when multiple services weren't working during the outage. I'm an end user; I don't use AWS, nor do I work on anything using AWS.
From my perspective, multiple unrelated websites quit working at the same time. I would rather have had one website down, and the rest working, than for me to be completely hamstrung because so many services are down simultaneously.
If you were dependent upon a single distribution (region) of that Service, yes it would be a massive single point of failure in this case. If you weren't dependent upon a particular region, you'd be fine.
Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.
So, AWS's redundant availability goes something like "Don't worry, if nothing is working in us-east-1, it will trigger failover to another regions" ... "Okay, where's that trigger located?" ... "In the us-east-1 region also" ... "Doens't that seem a problem to you?" ... "You'd think it might be! But our logs say it's never been used."
Relying on AWS is a single point of failure. Not as much as relying on a single AWS region, but it's still a single point.
It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.
It's about how much your risk level is.
AWS us-east-1 fails constantly, it has terrible uptime, and you should expect it to go. A cyberattack which destroyed AWSs entire infrastructure would be less likely. BGP hijacks across multiple AWS nodes are quite plausible though, but that can be mitigated to an extent with direct connects.
Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.
I can avoid London as a single point of failure, but the loss of Docklands would cause so much damage to the UK's infrastructure I can't confidently predict that my servers in Manchester connected to peering points such as IXman will be able to reach my customer in Norwich. I'm not even sure how much international connectivity I could rely on. In theory Starlink will continue to work, but in practice I'm not confident.
When we had power issues in Washington DC a couple of months ago, three of our four independent ISPS failed, as they all had undeclared requirements on active equipment in the area. That wasn't even a major outage, just a local substation failure. The one circuit which survived was clearly just fibre from our (UPS/generator backed) equipment room to a data centre towards Baltimore (not Ashburn).
Same thoughts. Company is currently migrating from tech A to tech B, and while AI gets us 70-80% of the way, due to the riskier nature of the business, we now spend way more time reviewing the code.
Very interesting thought process, with lots of nitty gritty details. I recently had some idea around a repetitive process at work, and decided to try it in TUI. Oh what a ride it was!
Even armed with a library like charms or bubbletea in Golang, sometimes its just amazing how all the internals "clicked" together, to render layouts and widgets.
I started with Jekyll, and tried out Hugo since there was all the hype.
I didn't had the same issues as OP, but Hugo docs are so confusing that I just gave up and switched back. Jekyll works OK for my simple needs, and still does.
I've tried this before, with a service worker[1] that intercepts TS/X-ish requests, and directing them over to sucrase[2] to compile to JS, before being loaded by the browser. Unfortunately sucrase seems to be no longer maintained.
I'm honestly impressed that the sutro team could write a whole post complaining about Flash, and not once mention that Flash was actually 2 different models, and even go further to compare the price of Flash non-thinking to Flash Thinking. The team is either scarily incompetent, or purposely misleading.
Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.
Thanks a lot friend, But one of the issues with this is that I need to know about requests and sometimes their names can be different and I actually had created a cli tool called uvman which actually wanted to automate that part too.
But my tool was really finnicky and I guess it was built by AI ,so um yea, I guess you all can try it, its on pypi. I think that it has a lot of niche cases where it doesn't work. Maybe someone can modify it to make it better as I had built it like 3-4 months ago if I remember correctly and I have completely forgotten how things worked in uv.