Hacker News new | past | comments | ask | show | jobs | submit login
Moving from AWS to Bare-Metal saved us $230k per year (oneuptime.com)
291 points by devneelpatel on Nov 16, 2023 | hide | past | favorite | 335 comments



They were paying on-demand ec2 prices and reserved instances alone would save them ~35%, savings plan even more which would apply to egress and storage costs too. Anyway, they're still saving a lot more (~55%), but it's not nearly as egregious of a difference.


After that 35% savings, they ended up saving about a US mid level engineer's salary, sans benefits. Hope the time needed for the migration was worth it.


I broke the rules and read the article first:

> In the context of AWS, the expenses associated with employing AWS administrators often exceed those of Linux on-premises server administrators. This represents an additional cost-saving benefit when shifting to bare metal. With today’s servers being both efficient and reliable, the need for “management” has significantly decreased.

I also never seen an eng org where substantial part of it didn’t do useless projects that never amount to anything


I get the point that they tried to make, but this comparison between "AWS administrators" and "Linux on-premises server administrators" is beyond apple-and-oranges and is actually completely meaningless.

A team does not use AWS because it provides compute. AWS, even when using barebonea EC2 instances, actually means on-demand provisioning of computational resources with the help of infrastructure-as-code services. A random developer logs into his AWS console, clicks a few buttons, and he's already running a fully instrumented service with logging and metrics a click away. He can click another button and delete/shut down everything. He can click on a button again and deploy the same application in multiple continents with static files provided through a global CDN, deployed with a dedicated pipeline. He clicks on another button again and everything is shut down again.

How do you pull that off with "Linux on-premises server administrators"? You don't.

At most, you can get your Linux server administrators to manage their hardware with something like OpenStack, but they would be playing the role of the AWS engineers that your "AWS administrators" don't even know exist. However, anyone who works with AWS only works on the abstraction layers above that which a "Linux on premises administrator" works on.


This is the voice of someone who has never actually ended up with a big AWS estate.

You don't click to start and stop. You start with someone negotiating credits and reserved instance costs with AWS. Then you have to keep up with spending commitments. Sometimes clicking stop will cost you more than leaving shit running.

It gets to the point where $50k a month is indistinguishable from the noise floor of spending.


> This is the voice of someone who has never actually ended up with a big AWS estate.

I worked on a web application that provided by a FANG-like global corporation that is a household name and used by millions of users every day, and which can and did made the news rounds if it experiences issues. It is a high-availability multi-region deployment spread about a dozen independent AWS accounts and managed around the clock by multiple teams.

Please tell me more how I "never actually ended up with a big AWS estate."

I love how people like you try to shoot down arguments with appeals to authority when you are this clueless about the topic and are this oblivious regarding everyone else's experience.


Hrm. I have worked for a global corporation that you have almost certainly heard of. Though it's not super sexy.

The parent you're replying to resonates with me. A lot of politics about how you spend and how you commit, it's almost as bad as the commitment terms for bare-metal providers (3,6,12,24month commits). Except the base-load is more expensive.

It depends a lot on your load, but for my workloads (which, is fragile dumb but very vertical compute with a wide geographic dispersion), the cost is so high that a few dozen thousand has been missed numerous times, despite having in-house "fin-ops" folks casting their gaze upon our spend.


Hey me too.


Going to be honest: If your AWS spend is well over 6 figures and you’re still click-ops-ing most things you’re:

1) not as reliable as you think you are 2) probably wasting gobs of money somewhere


From the parents posters comments, the developers could very well be putting together quick proofs of concepts.

I’ve set up an “RnD” account where developers can go wild and click ops away. I also set up a separate “development” account where they can test thier IAC manually and then commit it and it gets tested through a CI/CD pipeline. Then after that it goes through the standard pull request/review process.


> A random developer logs into his AWS console, clicks a few buttons, and he's already running a fully instrumented service with logging and metrics a click away

In a dream. In the real world of medium-to-large enterprise, a developer opens a ticket or uses some custom-built tool to bootstrap a new service, after writing a design doc and maybe going through a security review. They wait for the necessary approvals while they prepare the internal observability tools, and find out that there is an ongoing migration and their stack is not fully supported yet. In the meantime, he needs permissions to edit the Terraform files to update routing rules and actually send traffic to their service. At no point he does, or ever will, have direct access to the AWS console. The tools mentioned are the full-time job of dozens of other engineers (and PMs, EMs and managers). This process takes days to weeks to complete.


> A random developer logs into his AWS console, clicks a few buttons, and he's already running a fully instrumented service with logging and metrics a click away...

This only works that way for very small spend orgs that haven’t implemented soc 2 or the like. If that’s what you’re doing then probably should stay away from datacenter, sure


> This only works that way for very small spend orgs that (...)

No, not really. That's how basically all services deployed to AWS work once you get the relevant CloudFormation/CDK bits lined up. I've worked on applications designed with high-availability in mind, which included multi-region deployments, which I could deploy as sandboxed applications on personal AWS accounts in a matter of a couple of minutes.

What exactly are you doing horribly wrong to think that architecting services the right way is something that only "small spend orgs" would know how to do?


Your original comment gives an impression that you like AWS bc anyone can click-ops themselves a stack so that's why you got all these clickops comments.

How is an army of "devops" implementing your CF/CDK stack any different from an army of (lower paid) sysadmins running proxmox/openstack/k8s/etc on your hw?


> Your original comment gives an impression that you like AWS (...)

My comment is really not about AWS. It's about the apples-to-oranges comparison between the job of "Linux on-premises server administrator" and value-added of managing on-premises servers, and the role of "AWS administrator". Someone needs to be completely clueless to the realities of both job roles to assume they deliver the same value. They don't.

Someone with access to any of the cloud provider services on the market is able to whip out and scale up whole web applications with far more flexibility and speed than any conceivable on-premises setup managed with the same budget. This is not up for debate.

> How is an army of "devops" implementing your CF/CDK stack any different from an army of (lower paid) sysadmins running proxmox/openstack/k8s/etc on your hw?

Think about it for a second. With the exact same budget, how do you pull off a multi-region deployment with an on-premises setup managed by your on-premises linux admins? And even if your goal is providing a single deployment, how flexible are you to put up this scheme to test a prototype and afterwards shut down the service?


> Someone with access to any of the cloud provider services on the market is able to whip out and scale up whole web applications with far more flexibility and speed than any conceivable on-premises setup managed with the same budget.

Bullshit. I've seen people spin wheels for months/years deploying their cloud native jank and you should read the article - it's not nearly the same budget.

> Think about it for a second. With the exact same budget, how do you pull off a multi-region deployment with an on-premises setup managed by your on-premises linux admins?

You do realize things like site interconnect exist right? And it likely will be cheaper than paying your cloud inter-region transfer fees. You're going to be testing multi-regional prototype? please

Look there's a very simple reason why folks have been chasing public clouds and it has nothing to do their marketing spiel of elastic compute, increased velocity, etc. That reason is simple - teams get control of their spend without having to ask anyone for permission (like the old-school infra team).


You just log into the server...

Not everything is warehouse scale. You can serve tens of millions of customers from a single machine.


Not on HN. Where everyone uses Rust and yet needs a billion node web scale mesh edge blah minimally otherwise you are doing it wrong. Rather waste 100k per month on aws because ‘if the clients come downtime is expensive’ than just run a 5$ vps and actually make a profit while there are not many clients. It’s the VC rotten mindset. Good for us anyway; we don’t need to make 10b$ to make the investors happy. Freedom.


Yeah, that's part of it. The other part is that you can move stuff that is working, and working well, into on-prem (or colo) if it is designed well and portable. If everything is running in containers, and orchestration is already configured, and you aren't using AWS or cloud provider specific features, portability is not super painful (modulo the complexity of your app, and the volume of data you need to migrate). Clearly this team did the assessment, and the savings they achieved by moving to on-prem was worthwhile.

That doesn't preclude continuing to use AWS and other cloud service as a click-ops driven platform for experimentation, and requiring that anything that is targeting production to refactored to run in the bare-metal environment. At least two shops I worked at previously have used that as a recurring model (one focusing on AWS, the other on GCP) for stuff that was in prototyping or development.


> Yeah, that's part of it. The other part is that you can move stuff that is working, and working well, into on-prem (or colo) if it is designed well and portable.

That's part of the apples-and-oranges problem I mentioned.

It's perfectly fine if a company decides to save up massive amounts of cash by running stable core services on-premises instead of paying small fortunes to a cloud provider for the equivalent service.

Except that that's not the value proposition of a cloud provider.

A team managing on premises hardware barely covers a fraction of the value or flexibility provided by a cloud service. That team of Linux sysadmins does not nor will it ever provide the level of flexibility nor cover the range of services that a single person with access to a AWS/GCP/Azure account provides. It's like claiming that buying your own screwdriver is far better than renting a whole workshop. Sure, you have a point if all you plan on doing is tightening that screw. Except you don't pay for a workshop to tighten up screws, and instead you use it to iterate over designs for your screws before you even know how much load it's expected to take.


Counterpoint: most shops do not need most of the bespoke cloud services they're using. If you actually do, you should know (or have someone on staff who knows) how to operate it, which negates most of the point of renting it from a cloud provider.

If you _actually need_ Kafka, for example – not just any messaging system – then your scale is such that you better know how to monitor it, tune it, and fix it when it breaks. If you can do that, then what's the difference from running it yourself? Build images with Packer, manage configs with Ansible or Puppet.

Cloud lets you iterate a lot faster because you don't have to know how any of this stuff works, but that ends up biting you once you do need to know.


> Counterpoint: most shops do not need most of the bespoke cloud services they're using. If you actually do, you should know (or have someone on staff who knows) how to operate it, which negates most of the point of renting it from a cloud provider.

Well said! At $LASTJOB, new management/leadership had blinders on [0][1] and were surrounded by sycophants & "sales engineers". They didn't listen to the staff that actually held the technical/empirical expertise, and still decided to go all in on cloud. Promises were made and not delivered, lots of downtime that affected _all areas of the organization_ [2] which could have been avoided (even post migration), etc. Long story short, money & time were wasted on cloud endeavors for $STACKS that didn't need to be in the cloud to start, and weren't designed to be cloud-based. The best part is that none of the management/leadership/sycophants/"sales engineers" had any shame at all for the decisions that were made.

Don't get me wrong, cloud does serve a purpose and serves that purpose well. But, a lot of people willfully ignore the simple fact that cloud providers are still staffed with on-prem infrastructure run by teams of staff/administrators/engineers.

[0] Indoctrinated by buzz words [1] We need to compete at "global scale" [2] Higher education


> Yeah, that's part of it. The other part is that you can move stuff that is working, and working well, into on-prem (or colo) if it is designed well and portable. If everything is running in containers

Anyone who says that hasn’t done it at scale.

“Infrastructure has weight”. Dependencies always creep in and any large scale migration involves regression testing, security, dealing with the PMO, compliance, dealing with outside vendors who may have white listed certain IP addresses, training, vendor negotiations, data migrations etc.

And then even though you use MySQL for instance, someone somewhere decided to do a “load data into S3” AWS MySQL extension and now they are going to have to write an ETL job. Someone else decided to store and serve static web assets to S3.


I mean, aside from my current role in Amazon, my last several roles have been at Mozilla, OpenDNS/Cisco, and Fastly; each of those used a combination of cloud, colo and on-prem services, depending on use cases. All of them worked at scale.

I specifically said "if it is designed well", and that phrase does alot of heavy lifting in that sentence. It's not easy, and you don't always put your A-team on a project when the B or C team can get the job done.

The article outlines a case where a business saw a solid justification for moving to bare metal, and saved approximately 1-3 SDE (depending on market) salary in doing so.

That amount of money can be hugely meaningful in a bootstrapped business (for example, for one of the businesses my partner owns, saving that much money over COVID shut-downs meant keeping the business afloat rather than shuttering the business permanently).


I didn’t mean to imply that you have worked at scale, just that doing a migration at scale is never easy even if you try to stay “cloud agnostic”.

Source: former AWS Professional Services employee . I just “left” two months ago. I now work for a smaller shop. I mostly specialize in “application modernization”. But I have been involved in hairy migration projects.


Most folks aren't focused on portability. Almost every custom built AWS app I've seen is using AWS-specific managed services, coded to S3, SQS, DynamoDB, etc. It's very convenient and productive to use those services. If you're just hosting VMs on EC2, what's the point?


I worked for a large telco, where we hosted all our servers. Each server ran multiple services bare-metal - no virtualization, and it was easy to rollout new services, without installing new servers. I missed the level of control over network elements and servers, the flexibility, and ability to debug by taking network traces anywhere in the network in my next job using AWS.


hrm, did you read the article?

> Our choice was to run a Microk8s cluster in a colocation facility

they go on to describe they use helm as well. there's no reason to assume that "a a fully instrumented service with logging and metrics" still isnt a click and keypress away.

your points dont make a whole lot of sense in the context of what they actually migrated too.


Emm… run proxmox?


> "I also never seen an eng org where substantial part of it didn’t do useless projects that never amount to anything"

Bootstrapped companies generally don't do this btw. This is a symptom of venture backed companies.


Absolutely not my experience. I used to work for a Japanese company that was almost entirely self funded. They wouldn't even go to the bank and get business loans.

Your description applies to a substantial number of business units in that company. They also had a "research institute" whose best result in the last decade was an inaccurate linear regression (not a euphemism for ML).


I think if youre at the size of "business units" youre not "bootstrapping" any more


You've never had friend and colleagues working at big (local and international) established companies sharing their experience of projects being canned, and not just repurposed?


Don't do what? go cloud? Sure they do, they just generally don't get cloud credits so they can get hooked.


Apologies - I edited my comment to clarify context.

> I also never seen an eng org where substantial part of it didn’t do useless projects that never amount to anything


there's nothing about being bootstrapped vs venture backed that lets everyone know, a priori, whether a given project will be successful or not. something like 80% of startups fail within the first two years.


That is having your cake and eat it. AWS administrators don't do the same job as on prem administrators.


Well yeah, that's why they're more expensive.


> I also never seen an eng org where substantial part of it didn’t do useless projects that never amount to anything

Name one of business (tech or non tech) where this is ok/accepted and competitive in capitalism.

How long will we keep making these inflated salaries while being known for being wasteful, globally speaking?


What is that jockey doing on the horse? We only pay him to race!


It’s not like other industries (or academia) are any better. Please.


They also saved the salaries of the team whose job was doing nothing but chasing misplaced spaces in yaml configuration files. Cloud infrastructure doesn't just appear out of thin air. You have to hire people to describe what you want to do. And with the complexity mess we're in today it's not at all clear which takes more effort.


sorry, what? theyre running on k8s and using helm... so there's still piles of yaml. its wild to conflate migrating to bare metal with eliminating yaml-centric configuration.


100% this. Cloud is a hard slog too. A different slog though. We spend a lot of time chasing Azure deprecations. They are closing down a type of MySQL instance for example for one which is more “modern” but from the end user point of view it is still a MySQL server!


To manage a large fleet of physical servers, you need similar ops skills. You're not going to configure all those systems by hand, are you?


They spent $150,000 on physical servers. Probably 1 or 2 racks. Not much of a 'fleet'.


I mean, “fleet” semantics aside, surely $150,000 of servers is enough for at least one full-time person to be maintaining them. The point is that there are absolutely maintenance and ops costs associated with these servers.


There's maintenance cost to everything that runs a userland.

What we're missing is tracking what time is spent managing hardware and firmware, (or even network config, if we're being generous) and how much time is being spent on OS config.

From personal experience (as a sysadmin before it was entirely unsexy as a term) the overwhelming majority of my ops work was done in userland on the machine, maybe something like 96-97% of my tasks were nothing to do with hardware at all.

Since I got rebranded as SRE, the tools and the pay sure did get a lot better in the time, but the job is largely similar and ultimately running in VMs does make deployment faster, but once deployed I find the maintenance burden to be the same (or perhaps a little more) as things seem to become deprecated or require changes from our cloud vendor a bit more often.


Depends on the size of the fleet.

If you're using less than a dozen servers manual configuration is simpler. Depending on what you're doing that could mean serving a hundred million customers. Which is plenty for most business.


A dozen servers would be pushing it. It's not the size of the fleet, it's the consistency and repeatability of configuration and deployment. I assume there are other, non-production servers: various dev environments, test/QA. How do you ensure every environment is built to spec without some level of automation? It's the old "pets" vs "cattle" argument.

I've worked at companies with their own data centers and manual configuration. Every system was a pet.


Exactly. Last job I worked at there was always an issue with the YAML… and as a “mere” software engineer, I had to wait for offshore DevOps to fix, but that’s another issue.


If you are waiting on a “DevOps department”, it isn’t a DevOps…it’s operations


My company called them the DevOps Team


Did you try and fix it yourself? Was that not allowed?


Yeah that was not allowed. We started off managing our own DevOps in our own AWS region even, separate from the rest of the company. Eventually the DevOps team mandated we move to the same region as everyone else, and then soon after that there was enough conflict with my (offshore) team and DevOps (onshore) that DevOps demanded and got complete control over our DevOps, though not infrastructure.


CDK exists. Haven’t had to use yaml in ages and I refuse to.


Getting a bare metal stack has interesting side effects on how they can plan future projects.

One that's not immediately obvious is to keep on staff experienced infra engineers that bring their expertise for designing future projects.

Another is the option to tackle project in ways that would be to costly if they were still on AWS (e.g. ML training, stuff with long and heavy CPU load).


A possible middle-ground option is to use a cheaper cloud provider like Digital Ocean. You don't need dedicated infrastructure engineers and you still get a lot of the same benefits as AWS, including some API compatibility (Digital Ocean's S3-alike, and many others', support S3's API).

Perhaps there are some good reasons to not choose such a provider once you reach a certain scale, but they now have their own versions of a lot of different AWS services, and they're more than sufficient for my own relatively small scale.


That’s the niche DigitalOcean is trying to carve out. I’ve always loved and preferred their UI/UX to that of AWS or Azure. No experience with the CLI but I would guess it’s not any worse than AWS CLI.


How would you compare it to google cloud run? thanks


Yep and hardware is only getting cheaper. Better to just buy more drives/chips when you need them.


M4.xlarge (m4.xlarge, 4cpu, 16gb, no storage) is about $100 per month. This cpu is 7 years old(slow! and power hungry) and has 36 threads. This means it runs 18 of these instances.

The total revenue so far for one cpu is 100x18x12x7 = $150k If used as a spot instance it’s 144/month, so about 200k

A standard i9-14700k gen has 32 threads, but it can run 12 of these instances (max 192mem). This CPU will cost you $800. Memory is cheap, so for about 1-2k you’re all set, and have a machine that’s way faster and cheaper.

Basically, buy a bunch of NUCs and you’re saving yourself around $1500 per month per NUC. It pays itself back in 1 month

Cloud hosting is —insane—

Not even touching memory ballooning for mostly idle applications.

Lastly, don’t give me reliability s an argument. These were all ephemeral instances that have no storage, so you’ll have to pay for that slow non-nvme storage platform.


If you can do it for some much less than Amazon (and all the other cloud vendors), then why don’t you create your own cloud and undercut them?


You wouldn't ask people who cook for themselves why they don't open a restaurant to undercut the competition.


Hosting for the world is different than hosting for yourself.


I'm sure they are also getting better performance as well.

Not sure how to factor that $ into the equation.


Also, I'd imagine most companies can fill unused compute with long-running batch jobs so you're getting way more bang for your buck. It's really egregious what these clouds are charging.


To get real savings with a complex enough project you will need one or more FTE salaries just to stay on top of AWS spending optimizations


Plus...

2x FTEs to manage the AWS support tickets

3x FTE to understand the differences between the AWS bundled products and open source stuff which you can't get close enough to the config for so that you can actually use it as intended.

3x Security folk to work out how to manage the tangle of multiple accounts, networks, WAF and compliance overheads

3x FTEs to write HCL and YAML to support the cloud.

2x Solution architects to try and rebuild everything cloud native and get stuck in some technicality inside step functions for 9 months and achieve nothing.

1x extra manager to sit in meetings with AWS once a week and bitch about the crap support, the broken OSS bundled stuff and work out weird network issues.

1x cloud janitor to clean up all the dirt left around the cluster burning cash.

---

Footnote: Was this to free us or enslave us?


Our experience hasn’t been THAT bad but we did waste a lot of time in weekly meetings with AWS “solutions architects” who knew next to nothing about AWS aside from a shallow, salesman-like understanding. They make around $150k too, by the way. I tried to apply to be one, but AWS wants someone with more sales experience and they don’t really care about my AWS certs


As an AWS Solution Architect (independent untethered to Bezos) I resent that comment. I know slightly more than next to nothing about AWS and I can Google something and come up with something convincing and sell it to you in a couple of minutes!


How do I make $150k (or more) having high-level conversations about AWS with senior software engineers? Seriously. I’m sure it’s not an “easy” job, but fuck if I make less actually writing the software (median SWE salary is something like $140k in the US- depends on who you ask but it’s not the $250k+ that Levels.fyi would lead you to believe).


I can guarantee an SA working for AWS makes more than $150k. A returning L4 intern makes that much (former AWS Professional Services employee)

And no one cares about AWS certifications. They are proof of nothing and disregarded by anyone with a modicum of a clue.

I’m speaking as someone who once had nine active certifications and I believe I still have six active ones. I only got them as a guided learning path. I knew going in they were meaningless.


I’m sure everyone here is aware of the colloquial hate that certs get outside of IT (“A+”, etc.) , even I.

What I don’t know is why AWS would rather pay a salesman $150k (or more… I looked up salaries a few months ago, but either way…) to sell the wrong things to customers, rather than have a software engineer who has actually used these products, sell the right thing to customers. I should hope that all AWS Solutions Architects need to pass the cloud fundamentals exam before interacting with customers, but maybe not?

Deming is rolling in his grave.


There are different types of SAs at AWS. There are the generalist SAs who I never worked with and the specialist SAs who have deep experience in a specific area of the industry - not just AWS.

And even they aren’t to be confused with “consultants”. SAs are free to customers and give general guidance and are not allowed to give the customers any code.

Consultants are full time employees at AWS who get paid by the customer to do hands on keyboard work. But even we couldn’t work in production environments. We did initial work and taught the customer how to maintain and enhance the work.

If you don’t know the cloud fundamentals, learning enough to pass a few multiple choice questions.

As an anecdote, I passed the first one - the Solution Architect Associate - before I ever opened the AWS console.


Thanks for the detailed info.

I’m aware the bar is low when it comes to the entry-level certs, and that’s why I’d hope AWS SAs (the free kind) have to pass one or two.


This cracked me up. I was "asked" to get some AWS certs since I joined a company that was an AWS Partner. We have a new VP that is forcing other people to get them. Big waste of time for all practical purposes.


So to be clearer.

Having an AWS certification is not a requirement or even that important to get a job at AWS in the Professional Services department. Depending on your job position you are required to have certain certifications once you get there.

I now work for a partner and you are required to have a certain number of “certified individuals” to maintain partnership status. But even then, certifications never came up in my three interviews after getting “Amazoned” a couple of months ago.

But then again, after having AWS ProServe on my resume and having been a major contributor to a popular open source project in my niche, door opened for my automatically.


I didn't mind getting the certifications to "help out" the company, I just find it such a racket: paying for courses, buying books, $200 tests. Some people take months preparing for that stuff! I didn't buy any courses and only spent a few days preparing, but others spend tons of time and money on it.

And based on my own personal interactions with other "certified" individuals, it doesn't actually mean anything.


> Footnote: Was this to free us or enslave us?

I assume whichever provides more margin to Jeff Bezos.


Where I work (hint: very large satellite radio company) this is very much a thing.


I was thinking the same thing. If the migration took more than one man-year then they lost money.

Also what happens at hardware end-of-life?

Also what happens if they encounter an explosive growth or burst usage event?

And did their current staffing include enough headcount to maintain the physical machines or did they have to hire for that?

Etc etc. Cloud is not cheap but if you are honest about TCO then the savings likely are WAY less than they imply in the article.


> If the migration took more than one man-year then they lost money.

Your math is incorrect. The savings are per year. The job gets done once.

> Also what happens at hardware end-of-life?

You buy more hardware. A drive should last a few years on average at least.

> Also what happens if they encounter an explosive growth or burst usage event?

Short term, clouds are always available to handle extra compute. It's not a bad idea to use a cloud load-balancing system anyway to handle spam or caching.

But also, you can buy hardware from amazon and get it the next day with Prime.

> And did their current staffing include enough headcount to maintain the physical machines or did they have to hire for that?

I'm sure any team capable of building complex software at scale is capable of running a few servers on prem. I'm sure there's more than a few programmers on most teams that have homelabs they muck around with.

> Etc etc.

I'd love to hear more arguments.


The job is never done once. Not in hardware.


> Also what happens if they encounter an explosive growth or burst usage event?

TFA states that they maintain their AWS account, and can spin up additional compute in ~10 minutes.


Freedom from vendor lock in is hard to put a value on, but for me it’s definitely worth a mid level engineers salary in any context.


Yes, as long as you ignore the other literally 100 or so SaaS products that the average enterprise uses. You are always locked in to your infrastructure at any decent scale


If we saved 35% that could hire 20 FTEs.

Not that we'd need them as we wouldn't have to write as much HCL.


outside of the US you could likely pay for two mid or three juniors for this though, e.g. in France a junior engineer in an average city would likely be around 46-48k$USD for the total employer cost and France is already expensive compared to a lot of other countries with talented engineers


And the servers are brand new, so maintenance is not factor in yet.


They probably need to now hire 24/7 security to watch the bare metal if they're serious about it, so not sure about that engineer


onsite security is offered by the colo provider. You can also pay for locked cabinets with cameras and anti-tampering or even completely caged off depending on your security requirements


Also being an uptime site surprised they didn't use the m7g instance type.

Would've saved another ~30% for minimal difference in performance.

For me this doesn't look like a sensible move especially since with AWS EKS you have a managed, highly-available, multi-AZ control plane.


I’d be so excited to run my company’s observability platform on a single self-managed rack.


Right. Now what's the developer man-hour cost of the move?

Unless their product is pretty static and not seeing much development, they're probably in the negative.


It's not a very good question: they still have aws compatibility as their fail over/backup (should be live but that's another matter...)

What's capex vs opex now? Thats 150k of depreciable assets, probably ones that will be available for use long after all the current staff depart.

Everyone forgets what WhatsApp did with few engineers and less hardware, there's probably more than enough room for them to grow, and they have space to increase capacity.

The cloud has a place, but candidly so does a Datacenter and ownership.


> they still have aws compatibility as their fail over/backup

Now you're maintaining two tiers. That's more work, not less.


when we don't optimize for cloud and look at it from this angle and squint, it looks like we're saving money!


> Right. Now what's the developer man-hour cost of the move?

I think the real question is what's the cost to buy an equal number of training hours so you can pretend your resources are that competent.

Imagine a military that never fought, never did significant exercises and probably doesn't even have cleaning exercises any more on half its inventory.. That's basically how I view a company that had organic IT growth over a few years and hasn't done a major transition in anyone's recent memory.


Going from AWS to something like hetzner would be most of the way there probably.


Hetzner in particular is a disaster waiting to happen, but yes I agree with the sentiment. OVH doesn't arbitrarily shut off your servers or close your account without warning.


We’ve been on both Hetzner and OVH for years and have never had this happen.

The move does cost money, once. Then the savings over years add up to a lot. We made this change more than 10 years ago and it was one of the best decisions we ever made.


Hetzner randomly shuts down one of my servers every 2-3 months.


I am sorry what? I have been with Hetzner over 10 years hosting multiple servers without issue. There has to my knowledge never been a shutdown without notice on bare metal servers and it does not happen often. Like once every 2 years.


Hetzner suspended the account of a non-profit org I voluntarily supported, without explaining the reason or giving us possibility to take our data out. The issue was resolved only after bringing it to the public space. Even there they tried to pretend we are not actually their customers first


Right -- regrettable. And also a very rare anecdote.


I've been using Hetzner for years and what happens every 3-4 years is that a disk dies. So I inform them, they usually replace it within an hour and I rebuild the array, that's all.

Recently I've been moving most projects to Hetzner Cloud, it's a pleasure to work with and pleasantly inexpensive. It's a pity they didn't start it 10 years earlier.


Nice of them to test your failover for you.


I had the same issue, sent them ticket, they swapped server, and worked fine since then.


Yeah you do have to have redundancy built in, but we don't get random shutdowns.


I think OVH might let your server burn and the backup which was stored next to it (of course) with it ;)


Definitely don't keep all your data in one region only. For object storage, I prefer Backblaze unless you need high throughput.


Using both Hetzner and OVH for years and not a single problem be it technical or administrative. Does not men it never happens but this is just my experience


> Hetzner in particular is a disaster waiting to happen

Why would you spread FUD? They have several datacenters in different locations, and even if they were as incompetent as OVH (they are not)[0], the destruction of one datacenter doesn't mean you will lose data stored in the remaining ones.

[0] I bet OVH is also way smarter than they were before the fire.


With Hetzner presumably they mean them randomly closing accounts as sibling comments mention, or marking your account as suspicious and requiring you to send them a scan of your passport, and even then there are some comments[1] saying that they were still denied.

[1]: https://news.ycombinator.com/item?id=37108072


I don't believe in "randomly closing accounts". As for sending a scan of my passport/ID I did this when I opened the account with them which was back in 2010 or so.

During that time, one of my servers was hacked once (I was stupid enough to start digging Monero on the same system I had some other services installed) and another time one of my users had a weak password and his account was sending spam. In both cases they notified me and gave me the time to fix the problem. I also appreciate human contact and quick replies.


If that's the case, probably just going for Spot machines would save them more than that move.


Now you've got to muck up your app with logic to detect spot-market termination, pausing, and recovery. It's turtles all the way down.


Not really, just a sane ASG policy.

Also, there are companies that manage spot price allocation for you, so you should essentially always pay spot+small_x% and never actually get terminated.


Bandwidth would need to be compared and considered between EC2 and what they were able to negotiate for bare metal co-location.


Bandwidth is about 2 orders of magnitude less on non-cloud even without any negotiation or commitment. How much do you have to commit for e.g. Cloudfront to pay 2 orders of magnitude less than their list price of 0.02 per GB?


To give you ballpark I colo metal servers at a few different data centers. A gig up is a few hundred dollars a month.

Let's say you're really pushing the connection and your p95 is 900megabit up. That is $200 at colo vs ~$8200 for amazon.


Its not clear if they could have used Spot for most of the workloads, then they would be looking at 70 to 80% price reduction over On-Demand. They dont clarify their requirements for RTO. Because they are now at a single data center...


> savings plan even more which would apply to egress and storage

Wait, is this accurate?

If so I need to sign our company up for a savings plan... now. We use RI's but I thought savings plan only applied to instance cost and not bandwidth (and definitely not S3)


You got it right. They do not include traffic or S3.


Also, EKS (i.e a managed service) is also more expensive then renting EC2s and doing everything yourself, which is not that hard.


EKS's control plane consists of decent EC2 instances, ELB, ENIs distributed across multiple availability zones.

You're not saving anything doing it yourself.

And you've just given yourself the massive inconvenience of running a HA Kubernetes control plane.


Is 150$ really that much when you are paying hundreds of thousends for nodes ?


A lot of comments here seem to be along the lines of "you can hire one more engineer," but given the current economic situation remember that might be "keep one more engineer." Would you lay off someone on your team to keep AWS?

Keeping a few racks of servers happily humming along isn't the massive undertaking that most people here seem to think it is. I think lots of "cloud native" engineers are just intimidated by having to learn lower levels to keep things running.


> Keeping a few racks of servers happily humming along isn't the massive undertaking that most people here seem to think it is

Keeping them humming along redundantly, with adequate power and cooling, and protection against cooling- and power failures is more of an undertaking, though. Now you are maintaining generators, UPSs and multiple HVAC systems in addition to your 'few racks of servers'.

You also need to maintain full network redundancy (including ingress/egress) and all the cost that entails.

All the above hardware needs maintenance and replacement when it becomes obsolete.

Now you are good in one DC, but not protected against tornadoes, fire and flood like you would be if you used AWS with multiple availability zones.

So, you have to build another DC far enough away, staff it, and buy tons of replication software, plus several FTEs to manage cross-site backups and deal with sync issues.


You don't need to build your own datacenter unless your workload requires a datacenter's worth of hardware. Colocation is a feasible and popular option for handling all of the hands-on stuff you mention. Ship the racks to a colo center, they'll install them for you. Ship them replacement peripherals, and the operators will hot-swap them for you. If you need redundancy, that's just a matter of sending your hardware to multiple places instead of one. Slightly more involved, but it's hardly rocket science.


That all takes time. With cloud, you can have a system up and running in literal seconds, which is very nice when you find out that you severely underestimated how much traffic you web app will get.

But yeah, in the long run, colocation becomes significantly cheaper than cloud. You use AWS, and you'll find yourself paying $200/month for hardware you could buy once for $2,000.

Sometimes I think people forgot colocation is an option that exists.


Purely in hardware costs. More like $200/mo payment can be replaced by $100-130 in hardware.

For $600 I can get hardware that outperforms a $1200/mo ec2. Easy.


Do you realize how few Pennie’s $200 a month is compared to $2000 for any decent size company?


No, I'm afraid we can't do basic math. Nobody will ever know how long it takes for $200 a month to end up costing more than a single outlay of $2000.


I’m saying that little savings is a rounding error to any business of any size


He is giving an example for a single server. You can * 100 both numbers if you want.


> Now you are maintaining generators, UPSs and multiple HVAC systems in addition to your 'few racks of servers'.

I don't mean to call you out specifically, but I believe your comment is a perfect example of how the majority of developers have no clue what being on bare metal actually involves. You literally need to do none of these things. Cloud vendors make it sound overly complex and the myth just kinda self perpetuates because nobody knows better.

If anyone in the Bay Area is considering the move out of the cloud and wants to see in person what is really involved, I might consider putting together a group tour of one of my rack locations.


What do your racks do if they lose site power?

What happens when your own HVAC dies and your DC has about 4 hours until it overheats?

(I'm a software engineer who previously built and maintained racks of bare metal. Never again).


Unless you are a double digit billion dollar company, you don't build your own datacenters. You lease space from a colocation provider like Equinix, CoreSite, or DRT. Even AWS and GCP themselves are partially hosted in buildings operated by these companies.

These datacenters are fed by multiple power substations, have onsite battery and generators, and contracts for delivery of fuel in the event of a disaster. But none of these things are any more your problem than if a power plant explodes knocking out an AWS region.


I was replying with my interpretation of what you said above. Did I misunderstand you when you said you don't need generators, UPS, and multiple HVAC systems? If I read it correctlyu you're saying that developers have no clue what being on bare metal means, and that you don't need generators, UPS, or multiple HVAC. But then you said the colo facilities have those (along with other disaster plans).

> > Now you are maintaining generators, UPSs and multiple HVAC systems in addition to your 'few racks of servers'.

> I don't mean to call you out specifically, but I believe your comment is a perfect example of how the majority of developers have no clue what being on bare metal actually involves. You literally need to do none of these things.

By the way, I worked for a double digit billion dollar company that built its own datacenters as well as placing resources in colos. They started out purely in colos, and put several colos out of business over HVAC and power costs (back when rack space was billed by area, not cooling). Even after that, they stayed in colos, and when I worked there, we constantly had to deal with the unreliability of colos- not just that they were smaller, with less cooling, and inadequate power, but also because they often didn't actually fulfill their contractual requirements. ATL was a great example.

If colos work for you, that's great. I just don't thinnk they are prepared to handle disasters nearly as well as the megascale cloud providers.


> Did I misunderstand you when you said you don't need generators, UPS, and multiple HVAC systems?

No, you seem to misunderstand how leased space works. If you rent a floor of an office building, the toilets flush without you having to own a water plant or redundant water pipes.

> when I worked there, we constantly had to deal with the unreliability of colos- not just that they were smaller,

It sounds like whomever was in charge of picking datacenters was shit at their job. The colocation market isn't what it used to be and it isn't just a dude with some warehouse space and swamp coolers. Colos are publicly traded companies or REITs and have good SLAs.

> I just don't thinnk they are prepared to handle disasters nearly as well as the megascale cloud providers.

I worked for a megascale cloud provider. I'm intimately familiar with the nuts and bolts of a few others. Some of it is the big owned and operated campuses you see in the glossy brochure photos, but a substantial part is also in the same colos you can lease yourself. They don't bring in any additional cooling or power over and above what the datacenter provides.


You let the colo handle it.


Most of those requirements cease to exist if you decide to colo. It's not cloud or "run your own DC".


This is why you check out decent datacenters. A good DC is already BUILT above the 500 year or 1000 year flood plain, has N+1 generators, and tornadoes are not present in all locations.


You just come across as someone who is scared because you don't understand.

Seriously, a facility with multiple Internet connections, adequate power and cooling, passive cooling in case of outage, protection against power failures, adequate UPSes, backup generator power, and so on could be my Mom's house. There's nothing special about any of those things that makes them somehow frightening.

"buy tons of replication software"? What industry do you work in? Seriously, nobody who isn't in some clueless "enterprise" would pay good money for things that're widely available in open source.

I'm dismissive because these things aren't difficult if you've actually done them, so I can only assume you've never done them.


> I think lots of "cloud native" engineers are just intimidated by having to learn lower levels to keep things running.

Rightly so, because they're cloud native engineers, not system administrators. They're intimidated by the things they don't know. It'll be a very individual calculation whether it's worth it for your enterprise to organize and maintain hardware yourself, or isn't.


There's certainly no shortage of sysadmins comfortable with their on-prem skillsets that have their heads in the sand about the cloud.

And there are plenty of us who've spent time managing hardware and physical networks, transitioned to cloud, and are very happy to not be looking back.


And in other places around the world those would be closer to 3 or 4 good engineers for the same money. And while each engineer costs some money, they probably bring in close to double of what they are being paid.


Especially given the low unemployment rate, laying somebody off seems quite risky, if it doesn’t work out you’ll have trouble hiring some replacement I guess.


The current hiring market in tech is the easiest (for employers) than it has been in a really long time. It used to take 3-4 months to fill a role. In the current market it's more like 2-4 weeks.


Not for SRE and DBRE lol. 4-6 months easy.

I made the correct career choice. Downturn? What downturn?


The cost of one salary can probably be achieved by profiling and improving some software. Moving to graviton. Find and delete a few expensive queries. Add a database index. Switching from the cloud to on prem is a dramatic and expensive migration that changes a lot of constraints for your infra that you might not need to change in the first place.


eh, to a degree, having to deal with failed hardware and worse buggy hardware is just a pain and really time consuming.


>Keeping a few racks of servers happily humming along isn't the massive undertaking that most people here seem to think it is.

It isn't until hardware failures happen, and that requires different skills to deal with effectively. Like the time a core networking switch presented with a dead PSU and no backups, so you Frankenstein it back to temporarily working with another switch to pull the config off of it.

With bare metal you may have to have more generalized, Jack of all trades staff to account for the unaccountable.


If you are not using cloud services just virtual machines that's rarely worth it. And for a lot of people renting dedicated servers is the best business decision. It can be so much cheaper than EC2 that you solve your normal scalability problem by simply having a large amount of excess capacity. As you grow, some dedicated providers will give you an API to spin up a server in <120 seconds so even adding capacity quickly becomes possible. If your load is extremely spikey then this is not going to work but most people doesn't deal with loads like that -- or have other mitigation strategies.

Always keep an eye on your business goals and not on the hype. For oneuptime obviously downtime is a huge problem but you'd be surprised for how many businesses it's much cheaper to be down for a few minutes here and there than engineering a complex HA mechanism. The aforementioned spiking problem often can be solved cheaply by degrading the hot pages to static and serving them from CDN (if you have a mechanism for doing this of course). And so forth.

Remember KISS.


Moving from buying Ferraris to Toyota Camrys would save a lot of money too. These stories are always bs blog spam by companies trying to pretend they pulled off some amazing new hack. In reality they were burning cash because they hadn't the faintest idea how to control their spend.

  When we were utilizing AWS, our setup consisted of a 28-node managed Kubernetes cluster.
  Each of these nodes was an m7a EC2 instance. With block storage and network fees included,
  our monthly bills amounted to $38,000+
The hell were you doing with 28 nodes to run an uptime tracking app? Did you try just running it on like, 3 nodes, without K8s?

  When compared to our previous AWS costs, we’re saving over $230,000 roughly per year
  if you amortize the cap-ex costs of the server over 5 years.
Compared to a 5-year AWS savings plan? Probably not.

On top of this, they somehow advertise using K8s as a simplification? Let's reign in our spend, not only by abandoning the convenience of VMs and having to do more maintenance, but let's require customers use a minimum of 3 nodes and a dozen services to run a dinky uptime tracking app.

This meme must be repeating itself due to ignorance. The CIOs/CTOs have no clue how to control spend in the cloud, so they rake up huge bills and ignore it "because we're trying to grow quickly!" Then maybe they hire someone who knows the Cloud, but they tell them to ignore the cost too. Finally they run out of cash because they weren't watching the billing, so they do the only thing they are technically competent enough to do: set up some computers and install Linux, and write off the cost as cap-ex. Finally they write a blog post in order to try to gain political cover for why they burned through several headcount worth of funding on nothing.


> The hell were you doing with 28 nodes to run an uptime tracking app? Did you try just running it on like, 3 nodes, without K8s?

I've seen this before.

It happens when the people who manage the infrastructure are completely decoupled from the developers. The developers determine how much CPU / memory is "needed" for a given workload, and the infra team just adds it. Developers aren't responsible for infra costs. Infra team has limited ability to control costs.

Add a few years and teams with their own container-based applications, Kafka, Elasticsearch, multi-instance Postgres, etc., and soon enough, it's a 28 node cluster and costs are out of control. Infra team can only do so much, and devs aren't incentivized to help, either. Everyone's shrugging their shoulders because now it'll take significant refactoring and cross-silo work to actually fix.

If they told us what was running on those 28 nodes, we'd point that out immediately. But it's not just happening at this company. I've seen this pattern many times.


That's nutty spend for a company that tracks uptime. I ran a landing page hosting product with marketing automation with 10B daily events and 50B page views/day on 20 m3.larges. Our monthly AWS spend was like 20k.

All deployed using capistrano.


> The hell were you doing with 28 nodes to run an uptime tracking app?

To be fair, considering the pocket-calculator-grade performance you get from AWS (along with terrible IO performance compared to direct-attach NVME) I can totally understand they’d need 28 nodes to run something that would run on a handful of real, uncontended bare-metal hosts.


AWS IO is very good considering EVS is a networked drive. NVME is available if you need it too.


What are you talking about. That's crazy spend for that app. AWS isn't that bad.


Weren't you searching for a colo provider just yesterday? That was a quick $230k! https://news.ycombinator.com/item?id=38275614


Apparently the move in this article is for their EU operations, now they're looking for a US colo provider to do the same there.

https://news.ycombinator.com/item?id=38280506


They probably are looking for one legitimately. It appears they now use Scaleway dedicated bare metal servers


Oof this should be higher up things didn't quite add up before and now it's even more fishy


that's asking for the US, specifically.

looks to me like they did this in Europe previously, and they are looking to do the same in the US now.


So, they essentially saved the cost of one good engineer. Question is, are they spending 1 man-year of effort to maintain this setup themselves? If not, they made the right choice. Otherwise, it’s not as clear cut.


This fiction remains that AWS requires no specialist expertise.

And your own computers require expertise so expensive and frightening that no sane company would host their own computers.

How Amazon created this alternate reality should be studied in business schools for the next 50 years. Amazon made the IT industry doubt its own technical capabilities so much that the entire industry essential gave up on the idea that it can run computer systems, and instead bought into the fabulously complex and expensive and technically challenging cloud systems, whilst still believing they were doing the simplest and cheapest thing.


Amazon didn't create it. I was there for the mass cloud migrates of the last 15 years. It isn't that AWS requires no specialist expertise, it's that it's a certain kind of expertise that's easier to plan for and manage. Managing physical premises, hardware upgrade costs, etc are all skills your typical devops jockey doesn't need anymore. Unless you're fine with hosting your company's servers under your desk, it's the hidden costs of metal that makes businesses move to cloud.


Fortunately there are companies like Deft, OVH, Hetzner, Equinix, etc that handle all of that for you for a flat fee and while achieving economies of scale.

Colocation is rarely worth it unless you have non-standard requirements. If you just need a general-purpose machine, any of the aforementioned providers will sort you out just fine.


I agree. If you're doing general-purpose things, using a company like the one you mentioned is just cloud but with extra steps


This is a strawman that keeps getting brought up, but nobody's claiming that. The difference remains though and the scale depends on what exactly do you consider as an alternative. Renting a couple servers will cost you in availability/resilience and extra hardware management time. Renting a managed rack will cost you in the above and premium on management. Doing things yourself will cost you in extra contracts / power / network planning, remote hands and time to source your components.

Almost everything that the AWS specialist needs to know comes in after that and has some equivalent in bare metal world, so those costs don't disappear either.

In practice there are extra costs which may or may not make sense in each case. And there are companies that don't reassess their spending as well as they should. But there's no alternative realty really. (As in, the usually discussed complications of bare metal are not extremely overplayed)


AWS does require some expertise to master considering the sheet number of products and options. Tick the wrong box and cost increases by 50% etc.

Different solutions work best for different companies.


That's what he's saying.


>"This fiction remains that AWS requires no specialist expertise. And your own computers require expertise so expensive and frightening that no sane company would host their own computers."

Each of these statements is utter BS

PS. Oopsy I just read their third paragraph ;)


Read their third paragraph. They completely agree with you


LOL Sorry. I was shooting from the hip. Thanks.


That depends on where they are located. Good engineers aren't 230K$ / year everywhere.


I was curious about this too, but this company lists a range of $200-250k for remote.

https://github.com/OneUptime/interview/blob/master/software-...

Side note: I'm in slight disbelief at how high that salary range is compared to how minimal the job requirements are.


As someone that works somewhere that goes AWS first and is 100% cloud native, you'd have to give me a massive increase in salary to deal with hardware.

Been there, done that, don't care to return to that life.


That interview process seems … problematic. The whole thing gives really creepy vibes. It doesn’t sound like a super healthy place to work.


Americans get paid so much, got dayum.

Half that and half it again and I'd still be looking at a decent raise lmao


It include an asterisk. Those salaries come with the reality of living in locales like the bay area or Seattle and the like generally, with all the exorbitant costs of living in those areas

A lot of companies (like Amazon) will gleefully slash your salary if you try to move somewhere cheaper, because why should we pay you more if you don't just need that money to fork over to a landlord every month?

There's also all the things Americans go without, like socialized healthcare. Even with their lauded insurance plans they still pay significantly more for worse health outcomes than any other country


Nah even with bay area costs Americans get paid much much more than elsewhere. I could easily double my salary by moving from the UK to San Francisco. House prices are maybe double too, but since they are only a part of your outgoings, overall you come out waaaay ahead.

Of course then I would have to send my kids to schools with metal detectors and school shooting drills... It's not all about the money.


The employer es basically subsidizing the mortgage, incentivizing workers to move to the most expensive location, which make these locations even more expensive.


90%+ of Americans have some form of health insurance, especially tech workers. And there are issues with socialized health care as well


Yet surprisingly the US has by far the highest government expenditure on healthcare globally.


Don't forget, some of the worst outcomes per dollar spent


But best outcome if you do have the money. I've been waiting 38 months for an elective surgery in Canada.


Canada is all kinds of mixed up on this front. On paper healthcare is pretty much perfect but if you dig in a bit you find a whole raft of things that simply don't work. Canadians don't go bankrupt because of a medical mishap, but they're much more likely to die while waiting for something they need.


That's about senior-level compensation even among most companies in the Bay Area. Only the extreme outliers with good performance on the stock market can be said to be significantly higher in TC.

Edit: even then, TC is tied to how the stock market is doing, and not paid out by the company directly, so it only makes sense to compare with base wage plus benefits.


TC for senior at big tech companies is over $300k. Over $400k for Facebook I hear.

It doesn't take an extreme outlier to get significantly above $250k.

> so it only makes sense to compare with base wage plus benefits.

Not really, when stock approaches 40%+ of compensation, and is in RSU with fairly fast vesting schedule.


Big tech companies are extreme outliers. There's only a small fraction of big tech compared to other companies.


That's fair, though they do collectively employees hundreds of thousands of eng


1-2 weeks vacation if lucky, no overtime pay, 7/24 on-call (spread through team), many startups do 10-12h/day (and amazon), minus healthcare co-pay, etc.


That (and the other replies, I don't want to spam up the joint replying to all) is a fair point

Just to mention, I hope I didn't come across as saying paid too much. I'm always astounded as to how much cash is thrown about over the pond is all. Go get that bag, hell yeah! :)


Indeed. And it does of course needs to be offset to the cost of living.


Likely including benefits in this figure.


Right, like AWS is set and forget.


Depending on what parts of AWS you use it is.

Fargate, S3, Aurora etc. These are managed services and are incredibly reliable.

Lot of people here seem to think these cloud providers are just a bunch of managed servers. It's far more than that.


Even the "easy" services like that have at least _some_ barrier to entry. IAM alone is a pretty big beast and I doubt someone whose never used AWS would grasp it their very first time logging into the web interface - and every service uses it extensively.

And then there's the question of whether you're going to use Terraform, Ansible, CloudFormation, etc or click through the GUI to manage things.

My point is, nothing in AWS is 100% turnkey like a lot of folks pretend it is. Most of the time, it's leadership that thinks since AWS is "Cloud" that it's as simple as put in your credit card and you're done.


IAM and IaC is only needed once you get to a certain size.

For smaller projects you can absolutely get away with just the UI.


IAM is absolutely NOT something you can just ignore unless you have a huge pile of cash to burn when your shit gets compromised.


I worked at a startup, hosted on AWS, that was deployed before EC2 IAM roles were a thing. We had the same AWS access key credentials deployed on every machine. Whenever an employee left, we had to rotate them all.. Fun times.


You absolutely need IAM immediately, if you have any services talking to any other services.

You _should_ use IaC immediately as well, because the longer you delay, the more it's going to hurt when you finally do need it.


There are companies earning money by showing other companies how to reduce their AWS bill.


Exactly. A more accurate figure would be the difference between the work hours spent maintaining bare metal minus the work hours spent maintaining AWS. Impossible to know without internals but at least a point in favor of bare metal


Set and forget until you wake up to an astronomical bill one morning.


It has been for us. Then again, we only use predictable, sane services.


1 man-year effort is probably less than the effort of AWS, though. So a double win!

A bit in jest, but places I've worked where we've moved to the cloud ended up with more people managing k8s and building a platform and tooling, than when we had a simple inhouse scp upload to some servers.


Even if they do spend 1 person-year of effort in maintenance they still may have made the correct choice. Having a good engineer on staff may have additional side benefits as well especially if they could manage to hire locally and that person's wages then contribute to the local economy. As you said though it's definitely not clear cut especially from a spectator's point of view.


The beauty of decisions like these is that it looks good on a bean counter's spreadsheet. The hours of human time they end up spending on its maintenance simply don't appear in that spreadsheet, but is gladly pushed onto everyone else's plates.


With opportunity cost its multiples more. We don't hire people to break even on them right, we hire to make a profit.


Is this not addressed in section 'Server Admins' of the article?


The post is super light on details, it's hard to visualize if it's worth it or not.

For examples:

- How much data are they working with? What's the traffic shape? Using NFS makes me think that they don't have a lot of data.

- What happened when their customers accidentally sent too much events? Will they simply drop the payload? In bare-metal they lose the ability to auto-scale quickly.

- Are they using S3 or not, if they are, did they move that as well to their own Ceph cluster?

- What's the RDBMS setup? Are they running their own DB proxy that can handle live switch-over and seamless upgrade?

- What's the details on the bare metal setup? Is everything redundant? How quickly can they add several racks in one go? What's included as a service from their co-lo provider?


It is not unlikely an AWS to GCP migration would have saved them significant money too, in the sense that they likely reviewed and right sizes different systems.

I also would love to see a comparison done by a financial planning analyst to ensure no cost centres are missed. On prem is cheaper but only by 30 to 50%. That is the premium you pay for flexibility, which you can partly mitigate by purchasing reserved instance for multiple years.


> On prem is cheaper but only by 30 to 50%

Depending on use case.

If you have traffic which isn't consistent 24/7 then AWS Spot instances with Gravitron CPUs will be cheaper than on-premise.

Because you have the ability to in real-time scale your infrastructure up/down.


Its 2 seperate issues:

fluctuation in traffic is handled by auto scaling

Saving money on stateless (or short start times) services is done with spots


We're in the business of making money. Our product makes the money. Not our skills and abilities to manage hardware and IT infrastructure.

They have a decent business case, but I don't feel like they executed well to meet the real objective. They don't want to be in AWS since they're an uptime monitor and they want to alert on downtime on AWS. But they have a single rack of servers in a single location. A cold standby in AWS doesn't mean a ton unless they're testing their failovers there... which comes at quite the cost.

I've worked on-prem before. Now I work somewhere that's 100% AWS and cloud native. You'd have to pay me quite a bit more to go back to on-prem. You'd have to pay me quite a bit more to go somewhere not using one of the 3 major clouds with all their vendor specific technologies.

The speed is invaluable to a business. It's better to have elevated spend while trying to find good product market fit. I didn't understand this until I worked somewhere with a product the market wanted. > 50% growth for half a decade wouldn't have been possible on-prem. > 25% growth for a full decade wouldn't have been possible on-prem.

I've been with my current company from $80MM ARR to $550MM ARR. We've never breached more than 1.5% of income on total cloud spend. We've been told that's the lowest they've seen by everyone from AWS TAMs to VC/PE people. It's because we're cloud native, we've always been cloud native, and we're always going to be cloud native.

You've gotta get over "vendor lock-in". With our agility it's not really a thing. We could move to another cloud or on-prem if we really wanted. Wouldn't be a huge problem moving things service by service over... though we've had a few more recent changes that would be troublesome, we'd be able to work around them because we're architected well.


I’d say $500k per year on AWS is kind of within a dead man’s zone where if you’re not expecting that spend to grow significantly and your infra is relatively simple, migrating off may actually make sense.

On the other hand maintaining $100K a year of spend on AWS is unlikely to be worth the effort of optimizing and maintaining $1M+ on AWS probably means the usage patterns are such that the cloud is cheaper and easier to maintain.


In my experience amounts are meaningless, what counts is what kinds of services you need most. In my current org we use all 3 major public clouds + on-on prem services, carefully planning what should go where and why.


How are such savings not obvious after putting the amounts in an Excel sheet, and spending an hour over it (and most importantly doing this before spending half a million/year on AWS)?


> and most importantly doing this before spending half a million/year on AWS

AWS is... incentivizing scope creep, to put it mildly. In ye olde days, you had your ESXi blades, and if you were lucky some decent storage attached to it, and you gotta made do with what you had - if you needed more resources, you'd have to go through the entire usual corporate bullshit. Get quotes from at least three comparable vendors, line up contract details, POs, get approval from multiple levels...

Now? Who cares if you spin up entire servers worth of instances for feature branch environments, and look, isn't that new AI chatbot something we could use... you get the idea. The reason why cloud (not just AWS) is so popular in corporate hellscapes is because it eliminates a lot of the busybody impeders. Shadow IT as a Service.


Those busybodies are also there to keep rogue engineers from burning money on useless features (like AI chat bots) that only serve to bolster their promo packet...


> keep rogue engineers from burning money on useless features (like AI chat bots)

As someone who has worked on an AI chat bot I can assure you it does not come from engineers.

It's coming from the CFO who is salivating at the thought of downsizing their customer support team.


The other reply in this thread might beg to differ: https://news.ycombinator.com/item?id=38295849

I think the reality is no discipline, be it Engineering, Product or Finance, is immune to flights of fancy.


I literally started laughing at this. I worked at a bare-metal shop fairly recently and a guy on my team used a corporate credit card to set up an AWS account and create an AI chatbot.

The dude nearly got fired, but your comment hit the spot. You made my night, thank you.


That depends on how the incentive structures for your corporate purchase department are set up - and there's really a ton of variance there, with results ranging from everyone being happy in the best case to frustrated employees quitting in droves or the company getting burned at employer rating portals.


> That depends on how the incentive structures for your corporate purchase department are set up

Sure, but that seems orthogonal to the pros and cons of having more layers of oversight (busybodies, to use your term) on infra spend. Badly run companies are badly run, and I don't think having the increased flexibility that comes from cloud providers changes that.


So instead of using 500k of engineering time when you need more resources, you're now saving 50k of AWS overspend.


I would be surprised if people didn't know that coloing was cheaper. I certainly evangelize it for workloads that are particularly expensive on AWS.

It's not entirely without downsides though and I think many shops are willing to pay more for a different set of them. It is incredibly rewarding work though. You get to do magic.

* You do need more experienced people, there's no way around it and the skills are hard to come by sometimes. We spent probably 3 years looking to hire a senior dba before we found one. Networking people are also unicorns.

* Having to deal with the full full stack is a lot more work and needing manage IRL hardware is a PITA. I hated driving 50 miles to swap some hard drives. Rather than using those nice cloud APIs you are also on the other side implementing them. And all the VM management software sucks in their own unique ways.

* Storage will make you lose sleep. Ceph is a wonder of the technological world but it will also follow you in a dark alleyway and ruin your sleep.

* Building true redundancy is harder than you think it should be. "What if your ceph cluster dies?" "What if your ESXi shits the bed?" "What if Consul?" Setting things up so that you don't accidentally have single points of failure is tedious work.

* You have to constantly be looking at your horizons. We made a stupid little doomsday clock web app that we put all the "in the next x days/weeks/months we have to do x or we'll have an outage." Because it will take more time than you think it should to buy equipment.


It's great when you don't need instant elasticity and traffic is very predictable.

I think it's very useful for batch processing, especially owning a GPU cluster could be great for ML startups.

Hybrid cloud + bare metal is probably the way to go (though that does incur the complexity of dealing with both, which is also hard).


Cloud is putting spend decisions into individual EMs’ or even devs’ hands. With baremetal one team (“infra” or whatever) will own all compute and thus spend decisions need to be justified by EMs which they usually dont like ;)


Bare-metal solutions save money, but are costly in terms of development time and lost agility. Basically, they have much more friction.


huh, say wut?

I guess before Amazon invented "the cloud" there wasn't any software companies...


AWS isn't just IaaS they are PaaS.

So it's a fact that for most use cases it will be significantly easier to manage than bare metal.

Because much of it is being managed for you e.g. object store, databases etc.


Setting up k3s: 2 hours

Setting up Garage for obj store: 1 hour.

Setting up Longhorn for storage: .25 hour.

Setting up db: 30 minutes.

Setting up Cilium with a pool of ips to use as a lb: 45 mins.

All in: ~5 hours and I'm ready to deploy and spending 300 bucks a month, just renting bare metal servers.

AWS, for far less compute and same capabilities: approximately 800-1000 bucks a month, and takes about 3 hours -- we aren't even counting egress costs yet.

So, for two extra hours on your initial setup, you can save a ridiculous amount of money. Maintenance is actually less work than AWS too.

(source: I'm working on a youtube video)


You should stick to making Youtube videos then.

Because there is a world of difference between installing some software and making it robust enough to support a multi-million dollar business. I would be surprised if you can setup and test a proper highly-available database with automated backup in < 30 mins.


When you are making multi-million dollars, that’s when you spend the money on the cloud. There is a spot somewhere between “survive on cloud credits” and “survive in the cloud” where “survive on bare metal” makes far more sense. The transitions are hard, but it is worth it.


What do you mean? You install postgres on 2+ machines and configure them.

Installing software is exactly what multi-million dollar companies do.

Backups are not hard either. There's many open source setups out there and building your own is not that complex.


Doing this right is easier said than done. I worked at one company that ran their own Postgres instances on EC2 instead of using RDS. Big mistake. The configuration was so screwed up they couldn't even take a backup that wasn't corrupted. They had "experts" working on this for months.


>Backups are not hard either.

I guess you never setup one, because I’ve see numerous attempts at this and none took less than a month to do.


Backups are not hard. I've set them up, and restored from them, numerous times in my career. Even PITR is not hard.

It takes about 10-15 mins to apply the configs and install the stuff, then maybe another 10-15 minutes to run a simple test and verify things.


> You install postgres on 2+ machines and configure them.

Oh sweet summer child... let me know how that goes for you.


It's easy to stand this stuff up initially, but the real work is in scaling, automating, testing and documenting all of that, and many places don't have the people, skills or both to do all of that easily.

Also, with EKS, you get literally all of this (except Cilium and Longhorn, if you need that, which you don't if you use vpc-cni and eks-csi), in ~8 minutes, and it comes with node autoscaling, tie-ins into IAM and a bunch of other stuff for free. This is perfect for a typical lean engineering team that doesn't really do platform stuff but need to out of necessity and/or a platform team that's just getting ramped up on k8s.

You also don't need to test your automation for k8s upgrades or maintain etcd with EKS, which can be big time-savers.

(FWIW I love Kubernetes and have made courses/workshops of exactly this work)


Except that Kubernetes, along with all the additional components that are needed, for example, you mentioned Longhorn and a database such as PostgreSQL, Cilium, and others, has a multitude of ways to fail in unpredictable manners during updates, and there can be many hidden, minor bugs. The list of issues on GitHub for these projects is very lengthy.

I'm not suggesting that this isn't a viable solution, but I would prefer to be well-prepared and have a team of experts in their respective fields who are willing to have an on-call duty. This team would include a specialist for Longhorn or Ceph, since storage is extremely important, one to set up and maintain a high-availability PostgreSQL database with an operator and automated, thoroughly tested backups, and another for eBPF/Cilium networking complexities, which is also crucial because if your cluster network fails, it results in an immediate major outage.

Certainly, you can claim that you have sufficient experience to manage all these systems independently, but when do you plan to sleep if you're on call 24/7/365? Therefore, you either need a highly competent team of domain experts, which also incurs a significant cost, or you opt for cloud services where all of this management is taken care of for you. Of course, this service is already included in the price, hence it's more expensive than bare-metal.


I feel like this is a pretty short-sighted solution.

> you either need a highly competent team of domain experts, which also incurs a significant cost

You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important.

> you opt for cloud services where all of this management is taken care of for you

Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code.

When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded.

Like all things in software, there are tradeoffs.


Based on my experience, uptime for the Generally Available (GA) services in the cloud typically ranges between 99.9% and 99.95% (maximum 8 hours and 41 minutes of downtime per year in case of 99.9) for a single region. This aligns with my long-term experience as a Google Cloud Platform (GCP) user. If you use preview or beta versions of services, the reliability may be lower, but then you're taking on that risk yourself.

Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.


On my first day ever in production on AWS, we had an entire 9 hours of downtime[1]... so maybe I'm a bit biased, especially because we had another one not too long after that one[2] on Christmas freaking Eve. Prior to moving to AWS, resolution timelines were able to be passed to stakeholders within minutes of discovering the issue. After moving to AWS, we played darts and looked like fools because there was nothing we could do while the company hemorrhaged money.

The cloud is much more mature as is dev-ops in general, these days, but major outages still happen. If you run your own cloud, you'll still have major outages. You can't really escape them.

If you have the expertise or can get the expertise, to do it in-house, you should do it. Just look to the US and its inability to build an in-expensive rocket, or even just manufacture goods. They outsourced everything (basically) to the point where they are reliant on the rest of the world for basic necessities.

You gotta think long-term, not short-term.

[1]: https://aws.amazon.com/message/680342/

[2]: https://aws.amazon.com/message/680587/


Setting up a db with backups, PITR, periodic backup tests in 30 minutes? You made my day :) Also how about controlling who has access to servers, tamper protected activity logs? It is just scratching the surface.


Yeah, it is pretty straightforward to set it up in about 30-45 minutes. I have a cookbook I've been building since 2004, and I keep it updated. Maybe one day I'll publish it, but mostly, it is for me.

> how about controlling who has access to servers

It depends on where you want to control access to and how. If you are talking about SSH, you can bind accounts to a GitHub team with about 20 lines of bash deployed to all the servers. Actually, I have a daemonset that keeps it updated.

Thus only people in my org, on a specific team, can access a server. If I kick them out of the team, they lose access to the servers.

I have something similar for k8s, but it is a bit more complicated and took quite a bit of time to get right.

> tamper protected activity logs?

sudo chattr +au /path/to/file

will force Linux to only allow a file to written to and read, but only in append mode and deletions are actually soft-deletes (assuming you are using a supported filesystem). Things like shell history files, logs, etc. make a ton of sense to be set that way. There are probably a couple of edge cases I'm forgetting, but that will get you at least 80% there.


Debugging longhorn outages on a monthly basis: priceless.


I haven’t had any issues with longhorn in a long time. Most issues I run into are network related, in fact, I tend to find a lot of cilium bugs (still waiting on one to be fixed to upgrade, actually).


I hope you like slugging through the depths of Kubernetes api-resources and trudging through a trillion lines of logs across a million different pods


Because the amount of time your engineers will spend maintaining your little herd of pet servers, and the opportunity cost of not being able to spin up manged service X to try an experiment, are not measurable.


> maintaining your little herd of pet servers

You know bare metal can be an automated fleet of cattle too, right?


Have you ever heard of pxe boot? You should check out Harvester made by Rancher (IIRC). Basically, manage bare metal machines using standard k8s tooling.


In my opinion the story here is that AWS allowed them to quickly build and prove out a business idea. This in turn afforded them the luxury to make this kind of switch. Cloud did it's job.


It's good take and one that's important to keep in mind when you're engineering your stack to be fully dependent on your chosen cloud vendor


this is a very good take on the situation.


These real world stories are always fun to hear because they trigger such defensiveness from for those who are invested in "cloud computing". This so-called cloud computing seems to have a strong need for a cheering section (hype) and a jeering section (sniping anyone who questions its value). Bring on the jeers.


I look at this differently. On-prem proponents always fail to account for hard problems they absolutely have to solve, like tamper protected access logs, centralized server access controls, replacing failed hardware, backups, secure networking between servers in a collocation (are you sure a neighbor droplet cannot do an arp poisoning and hijack the traffic?). All this will bite them sooner or later, especially if they operate $250k worth of hardware. But because it does not bother them at the moment they can claim they don’t need this or it is not a problem thus rendering cloud as expensive.


Curious how much was spent on the migration? I skimmed but didn't see that number.

> Server Admins: When planning a transition to bare metal, many believe that hiring server administrators is a necessity. While their role is undeniably important, it’s worth noting that a substantial part of hardware maintenance is actually managed by the colocation facility. In the context of AWS, the expenses associated with employing AWS administrators often exceed those of Linux on-premises server administrators. This represents an additional cost-saving benefit when shifting to bare metal. With today’s servers being both efficient and reliable, the need for “management” has significantly decreased.

This feels like a "famous last words" moment. Next year there'll be 400k in "emergency Server Admin hire" budget allocated.


And now their blog won’t load


I've said this before, unless you are using specific AWS services, I think it is a fools errand to use it.

Compute, storage, database, networking. You would be better off using Digital Ocean, Linode Vultr etc. so much cheaper than AWS, lots of bandwidth included rather than the extortionate $0.08 GB egress.

Compute is the same story. 2 VCPU, 4GB VPS is ~$24 using a VPS. The equivalent instances (after navigating the obscured pricing and naming scheme), is the c6g.large is double the price at $50.

This is the happy middle ground between bare metal and AWS.


Anyone can comment on server lifetime of 5 years? I would think it’s on the order of 8-10 years these days? Cores don't get that much faster you just get more of them, etc


Server hardware is pricey, and if it's not pricey it fails often. 5 years would be a very good lifetime for a heavy use server, especially HDs.


Ok that is just fud and i was asking about servers themselves - disks can be swapped.

> 5 years would be a very good lifetime for a heavy use server, especially HDs.

Read Backblaze report - a lot of their HDs are over 8yo and the afr is less than 2%. SSDs will actually fail faster under heavy write load around 5 years yes


Yeah, it's easy to swap out. I'm just pointing out that it's not the same as having a gaming PC at home.


Disclaimer: Former AWS consultant from banking/Wall St.

It's been a big "duh" for 20+ years. Large-scale, consistent loads aren't suited to cloud infrastructure. Mostly its shops that don't care about costs, don't know any better, or lack technical capabilities outsource most of their infrastructure.

The use-cases for *aaS are:

- Early startups

- Beginning projects

- Prototyping

- Peaky loads on-demand or one-of

- Evade corporate IT department

AWS and VPSes can also be ill-suited for personal use if you live in a major city with bottom tier, cheap datacenters that can rent a 1 GbE uplink, a PDU plug, and 4U. For anything substantial, it's not hard to lease some dark fiber and run (E)BGP.

https://www.ripe.net/manage-ips-and-asns/as-numbers/request-...


The math I have always seen is cloud is around 2.5x more expensive than on-prem UNLESS you can completely re-architect your infra to be cloud native.

Lift and shift is brutal and doesn't make a lot of sense.


> The math I have always seen is cloud is around 2.5x more expensive than on-prem UNLESS you can completely re-architect your infra to be cloud native.

And at this point you are completely locked in.


Now get rid of k8s, put the system back together into a monolith, optimize it, get 3 mac minis with gigabit ethernet and save yourself another 65k. It's an uptime monitoring tool...


The key point is per year - ongoing saving every year.


I would go with managed bare-metal, it's a step up from unmanaged bare metal cost wise but saves you on headaches from memory, storage, network, etc issues.


Very nice ! (Hard time finding the author/contact info though...)

At "Storage and LoadBalancers" the NFS link point to https://microk8s.io/docs/nfs Should be https://microk8s.io/docs/addon-nfs


If you're using AWS primarily as a place to run virtual machines, you probably should be somewhere else. The largest benefits appear when you primarily use cloud-native things: lambdas, sqs, dynamodb, etc. Of course these will constitute a vendor lock-in, but that's usually an acceptable compromise.


I'm in disbelief at how much people pay for aws. At a startup the costs were close to $500 per dev account. Just to keeping a few lambda functions and dynamo tables.

Rolling a droplet, load balancer and database costs like $30.


Cheaper price, lower redundancy:

>single rack configuration at our co-location partner

I've got symetrical gigabit static ipv4 at home...so can murder commercial offerings out there on bang/buck for many things. Right up until you factor in reliability and redundancy.


Me too. I have an entire /24 routed to my home data center!


Can anyone explain or shed more light on how the "remote hands" part the support would work in bare metal server scenario? Would they run commands you give (or) will they follow your runbook (or) something else (or) all of the above?


There are two types of remote hands: dumb and smart (under various names)

Most colocation facilities include dumb hands for free. Push a button, tell me what lights are on, plug in a monitor and tell me what it says, move the network cable from port 14 to 15, replace the drive with the spare sitting in the rack, etc.

Smart hands are billed around $100-$300/hr or a flat rate per task from a menu. Write an image to a USB stick and reinstall the OS on a server. Unrack and replace a switch. Figure out which drive has failed in the server and replace it. etc.

I've ran computers in datacenters for 20+ years and maybe used smart hands 1 or 2 times.


Are they accounting for all of the labor costs of managing stuff in their stack that they previously deferred to AWS, like OS imaging/patching, backups, and building your own external services instead of whatever AWS had available?


I’m sure if the stack is simple enough it’s non trivial for most senior plus engineers to figure out the infrastructure.

I’ve definitely seen a lot of over engineered solutions in the chase of some ideals or promotions.


Actually saves less if you spread the development of transitions and op costs over 5 years. Hidden costs


Oh look at that. Another company who doesn’t know how to use cloud complaining about paying too much.


what is the market distortion that allows AWS margins to remain so high? there are two major competitors (Azure, GCP) and dozens of minor ones.

It seems crazy to me that the two options are AWS vs bare metal to save that much money. Why not a moderate solution?


AWS, Azure, and GCP are already competing in terms of pricing and customer acquisition. If you mention to Google that you are considering moving your multi-million dollar project to AWS due to a better price point, you will almost certainly receive a counteroffer. However, it's not just the big players in the game; there are also many smaller cloud providers like DigitalOcean, which are indeed a bit cheaper, but not as affordable as bare-metal solutions (e.g., Hetzner).

I came to the conclusion that when you factor everything in—the time it takes to maintain such a massive infrastructure operating smoothly across the globe, investment in the further development of cloud services, employing security experts, paying developers competitive salaries, and of course aiming for a profit margin for the company—you end up with the prices that are evident among the major cloud providers. It simply isn't feasible to offer these services for much less (there are a few exceptions people rightly complain about, such as stupidly high egress costs). It has reached a point where Google, for example, has attempted to undercut prices to such a degree that their cloud operations have been running at a loss in the past[^1].

[^1]: https://www.ciodive.com/news/google-cloud-revenue-Q2-2022/62... (read last paragraph)


Then you need to train people on Azure, GCP , etc.

It's probably easier to optimize your stack TBH. I can't wait until I get a chance to use ARM on AWS.


ServeTheHome has also written[0] a bit about the relative costs of AWS vs. colocation. They compare various reserved instance scenarios and include the labor of colocation. TL;DR: it's still far cheaper to colo.

[0] https://www.servethehome.com/falling-from-the-sky-2020-self-...


Would anyone be interested in an immutable OS with built in configuration management that works the same (as in the same image) in the cloud and on premise (bare metal, PXE, or virtual)? Basically using this image you could almost guarantee everything runs the same way.


Yes - I would be interested :) my issue is that there is a mixed workload of centralised cloud compute and physical hardware in strange locations. I want something like Headscale as a global mesh control plane and some mechanism for deploying immutable flatcar images that hooks into IAM (for cloud) and TPM (for BM) as a system auth mechanism.


My email is in my profile if you want to discuss this more...!



I thought coreos was dead


Fedora CoreOS and Flatcar are fedora and gentoo based descendants and are actively maintained

Last fcos release was a week ago - https://fedoraproject.org/coreos/release-notes?arch=x86_64&s...


Who would have thought that vendor lock in results in absorbent spend!


What vendor lock-in? They built their stack from the beginning to allow for hosting on bare metal, and then they made that transition.


Looks like they moved to Scaleway bare metal


Sorry, but if you're mainly using AWS for compute (ie EC2) and scale minimally you're doing it wrong and probably burning a whole lot of money.


now a set of linux machines is considered bare-metal?

I was under the impression that bare-metal means "no OS".


The term bare-metal in this context means installing and managing Linux (or your OS of choice) on hardware directly without a Hypervisor is considered a bare-metal deployment. It's never meant no operating system.


In this context it's running without the use of virtualisation


yes you should do your own cost optimisation, which also has it's own cost.


So the salary of 1 dev?


bare metal is cheaper than aws?! wild!

edit: when you selectively choose your data points, and ignore human and migration costs.


nines of up time?


I don't think they care since they moved to a single rack single colocation setup, no idea who saw that deployment plan and said "yeah this is sensible I'll approve it."

They're selling an observability solution...


It feels like every comment on this article didn’t read past the first paragraph. Every comment I see is talking about how they likely barely made any money on the transition once all costs are factored in, but they explicitly stated a critical business rationale behind the move that remains true regardless of how much money it cost them to transition. Since they needed to function even when AWS is down, it made sense for them to transition even if it cost them more. This may increase the cost of running their service (though probably not) but it could made it more reliable, and therefore a better solution, making them more down the line.


> Since they needed to function even when AWS is down

AWS as a whole has never been down.

It's Cloud 101 to architect your platform to operate across multiple availability zones (data centres). Not only to insulate against data centre specific issues e.g. fire, power. But also AWS backplane software update issues or cascading faults.

If you read what they did it's actually worse than AWS because their Kubernetes control plane isn't highly-available.


People often learn the lessons in a hard way: they will keep saving 230k/yr until one day their non-HA bare-metal is down and major customers retreat.


> We have a ready to go backup cluster on AWS that can spin up in under 10 minutes if something were to happen to our co-location facility.

Sounds like they already have their bases covered.


Still need to synchronise data, update DNS records, wait for TTLs to expire.

HA architectures exist for a reason because that last step is a massive headache.


They need to do fire drills and practice this maybe daily or at least weekly? Failover being a normal case. Can’t you do failovers in DNS?


Yes, you can do it in DNS. Update the record with your new ingress, then wait for the timeout on the old record to assert itself and the new connections move over.

Not all DNS servers properly observe caching timeouts, so some customers may experience longer delays before they see it working again.


A significant percentage of users will still have their DNS resolver chain caching the old host.

Because TTLs are a guide not mandatory. And many companies/ISPs ignore it for cost reasons.


>It's Cloud 101 to architect your platform to operate across multiple availability zones (data centres)

A huge multi billion dollar company with "cloud" in its name recently had a big downtime because they did not follow "cloud 101".


Some AWS outages have affected all AZs in a given region, so they aren't always all that isolated. For this reason many orgs are investing in multi-cloud architectures (in addition to multi region)


I'm not convinced of the critical business rationale. Your single data data center is much more likely to go down than a multi-AZ AWS deployment. The correct business rationale would be to go multi-cloud.


For what it's worth I'm not either and absolutely agree with your point, it just felt like that was the more important argument but not the one people were engaging with.


You can use multiple availability zones and if needed even multi cloud. If you own the hardware, you do regularly need to test the UPS power supply to ensure there is a graceful fail over in case of a power outage. Unless of course, you buy the hardware already hosted in a data centre.


Cue the inevitable litany of reasons why it is wrong to move out of “the cloud” in 3… 2… 1…


And these savings will be passed down to your customers too? Or....?


why should they be?


Only if they have a competitor with a better deal.


Cool, so you can hire one additional engineer. Are you sure your bare metal setup will occupy less than a single engineer’s time?


I'm not so sure it's a zero sum kind of thing. Yes, it seems likely that they are paying at least one full time employee to maintain their production environment. At the same time, AWS isn't without it's complications, there are people who are employed specifically to babysit it.


Sure people need to babysit AWS resources, but those resources still exist and need babysitting. The bare metal approach is purely additive. E.g. if you need someone to babysit EKS, then you also need someone to babysit your bare metal k8s setup, but likely the bare metal k8s setup is even more work.


AWS was operated by holy ghost I assume?


Ha, while I'd work for that salary, even half of it would almost double what I make as a sysadmin. I guess I work here for the mission. Plus I don't have to deal with cloud services except for when the external services go down. Our stuff keeps running though.


I’m including overhead in that number. FWIW I know many ICs earning over double that number, not even including overhead.


>FWIW I know many ICs earning over double that number, not even including overhead.

Keep rubbing salt lol I live in a low cost area though. It's even pleasant some times of the year.


They can fire the "AWS engineer".


I had an out of touch cofounder a few years back, he had asked me why the coworking space’s hours were the way they were, before interjecting that companies were probably managing their servers up there at the those later hours

like, talk about decades removed! no, nobody has their servers in the coworking space anymore sir.

nice to see people attempting a holistic solution to hosting though. with containerization redeploying anywhere on anything shouldn't be hard.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: