The original answer to "why does FastMail use their own hardware" is that when I...

llm_trw · 2024-12-22T20:52:05 1734900725

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

In 2006 when the first aws instances showed up it would take you two years of on demand bills to match the cost of buying the hardware from a retail store and using it continuously.

Today it's between 2 weeks for ML workloads to three months for the mid sized instances.

AWS made sense in big Corp when it would take you six months to get approval for buying the hardware and another six for the software. Today I'd only use it to do a prototype that I move on prem the second it looks like it will make it past one quarter.

bluGill · 2024-12-22T22:41:02 1734907262

Aws is useful if you have uneven loads. why pay for the number of servers you need for christmas the rest of the year? But if your load is more even it doesn't make as much sense.

comprev · 2024-12-23T09:51:24 1734947484

The business case I give is a website which has a predictable spike in traffic which tails off.

In the UK we have a huge charity fundraising event called Red Nose Day and the public can donate online (or telephone if they want to speak to a volunteer).

The website probably sees 90% of their traffic on the day itself - millions of users - and the remaining 10% tailing off a few days later. Then nothing.

The elasticity of the cloud allows the charity to massively scale their compute power for ONE day, then reduce it for a few days, and drop back down to a skeleton infrastructure until the next event - in a few years time.

(FWIW I have no clue if Red Nose Day ever uses the cloud but it's a great example of a business case requiring temporary high capacity compute to minimise costs)

leshenka · 2024-12-23T02:41:00 1734921660

But how does it look from aws point of view?

Everyone scales up around Christmas then scales down afterwards. What do THEY do with all the unneeded CPU-seconds for the rest of the year?

jedberg · 2024-12-23T04:22:00 1734927720

Only consumer scales up for the holidays. Most other industries scale down. The more companies they have, the more even the overall demand is for them.

Also, every unused resource goes into the spot market. They just have a bigger spot market during the year.

And lastly, that's why they charge a premium. Because they amortize the cost of spare hardware across all their customers.

prmoustache · 2024-12-23T07:53:17 1734940397

We certainly don't scale up around Christmas. Appart from online shops and shipping companies, why would everyone else scale up around Christmas?

bluGill · 2024-12-23T03:29:06 1734924546

Not everyone. Ag is in a low time around then and scales way back. I don't know what other industry is like.

calmbonsai · 2024-12-23T01:12:12 1734916332

This.

Plus bidding on spot-instances used to be far less gamed so if you had infrequent batch jobs (just an extreme version of low-duty-cycle loading), there was nothing cheaper and easier.

I've been out of that "game" for a bit, but Google Compute used to have the cheapest bulk-compute instance pricing if all you needed was a big burst of CPU.

It's all changed if you're running ML workloads though.

PeterStuer · 2024-12-23T06:36:56 1734935816

AWS was built on hordes of VC backed startups drowning in heaps of cash and very little operational expertise.

bigfatkitten · 2024-12-30T03:34:12 1735529652

And now we've got a generation of IT professionals coming up the ranks who have no idea how to operate their own infrastructure.

ForOldHack · 2024-12-23T06:29:51 1734935391

"buying the hardware from a retail store." Never buy wholesale and never develop on immature hardware, I have seen c** with multiple 9 y.o. dev servers. I could shorten the ROI to less than 6 months.

BrandoElFollito · 2024-12-23T14:09:34 1734962974

What is c*? (seriously, I am not a native speaker and cannot turn the stars into a word that makes sense)

collingreen · 2024-12-23T17:04:52 1734973492

No worries, I AM a native speaker and I can't figure out the stars OR the specific parsing of that comment

benterix · 2024-12-22T20:10:27 1734898227

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

You are not the only one. There are several factors at play but I believe one of the strongest today is the generational divide: the people lost the ability to manage their own infra or don't know it well enough to do it well so it's true when they say "It's too much hassle". I say this as an AWS guy who occasionally works on on-prem infra.[0]

[0] As a side note, I don't believe the lack of skills is the main reason organizations have problem - skills can be learned, but if you mess up the initial architecture design, fixing that can easily take years.

mbrumlow · 2024-12-23T07:39:19 1734939559

> I don't believe the lack of skills is the main reason organizations have problem

IDK. More and more I see the argument of “I don’t know, and we are not experts in xxx” as a winning argument of why we should just spend money on 3rd party services and products.

I have seen people getting paid 700k plus a year spend their entire stay at companies writing papers about how they can’t do something and the obvious solution is to spend 400k plus to have some 3rd party handle it, and getting the budget.

Let’s not get into what the conversation looks like when somebody points out that we might have an issue if we are paying somebody 700k to hire somebody else temporarily for 400k each year, and that we should find these folks who can do it for 400k and just hire Them.

All this to say that being a SWE in many companies today requires no ability to create software that solves business problems. But rather some sort of quasi system administrator manager who will maybe write a handful of DSL scripts over the course of their career.

lobsterthief · 2024-12-24T21:24:44 1735075484

It’s also human capital/resource allocation. We thought about spinning up our own servers at my last gig; we had the talent in house but that talent was busy building the product, not managing servers. I suppose it depends on what your need is as well.

benterix · 2024-12-25T19:23:55 1735154635

I see your point but my perspective on this shifted over the years. Whatever infra you set up, whether it's the public cloud or on prem, there is always the initial cost (starting with a simple account for small orgs, a landing zone for larger ones etc.) and this applies to every service, it's just registered in the books in a slightly different way. For example, whn you look at my Jira tickets, on prem we're patchng servers, and in the cloud we're usually updating container images. These two are not that different and you need to set aside some time for that. It's the same with upgrading Postgres on prem and RDS Postgres between major versions - you need to arrange the service window with product teams, do the migration on lower layers first and if all goes well you move on to prod.

Of course, many infra activities take less time in the public cloud. E.g. control plane maintenance and upgrades on EKS are managed by AWS and are mostly painless so you never worry about stuff like etcd. On the other hand, there is a ton of stuff you need to know anyway to operate AWS in a proficient and safe way so I'm not convinced the difference is that huge today.

jasode · 2024-12-22T20:56:35 1734900995

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive [...] To this day I still use bare metal servers for pretty much everything, [...] plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply

Your FastMail use case of (relatively) predictable server workload and product roadmap combined with agile Linux admins who are motivated to use close-to-bare-metal tools isn't an optimal cost fit for AWS. You're not missing anything and FastMail would have been overpaying for cloud.

Where AWS/GCP/Azure shine is organizations that need higher-level PaaS like managed DynamoDB, RedShift, SQS, etc that run on top of bare metal. Most non-tech companies with internal IT departments cannot create/operate "internal cloud services" that's on par with AWS.[1] Some companies like Facebook and Walmart can run internal IT departments with advanced capabilities like AWS but most non-tech companies can't. This means paying AWS' fat profit margins can actually be cheaper than paying internal IT salaries to "reinvent AWS badly" by installing MySQL, Kafka, etc on bare metal Linux. E.g. Netflix had their own datacenters in 2008 but a 3-day database outage that stopped them from shipping DVDs was one of the reasons they quit running their datacenters and migrated to AWS.[2] Their complex workload isn't a good fit for bare-metal Linux and bash scripts; Netflix uses a ton of high-level PaaS managed services from AWS.

If bare metal is the layer of abstraction the IT & dev departments are comfortable working at, then self-host on-premise, or co-lo, or Hetzner are all cheaper than AWS.

[1] https://web.archive.org/web/20160319022029/https://www.compu...

[2] https://media.netflix.com/en/company-blog/completing-the-net...

causal · 2024-12-26T16:26:19 1735230379

Right, AWS rarely saves on hardware/hosting costs, it saves developer-hours. Especially if you're a fast-moving organization that rapidly changing hardware needs, something like AWS gives you agility.

That said, most organizations are not nearly so agile as they'd like to believe and would probably be better off paying for something inflexible and cheap.

packtreefly · 2024-12-22T20:06:07 1734897967

> although I was worried that folks are too locked in to SaaS stuff

For some people the cloud is straight magic, but for many of us, it just represents work we don't have to do. Let "the cloud" manage the hardware and you can deliver a SaaS product with all the nines you could ask for...

> teaching a course on how to do all this ... there might be interest in that after all?

Idk about a course, but I'd be interested in a blog post or something that addresses the pain points that I conveniently outsource to AWS. We have to maintain SOC 2 compliance, and there's a good chunk of stuff in those compliance requirements around physical security and datacenter hygiene that I get to just point at AWS for.

I've run physical servers for production resources in the past, but they weren't exactly locked up in Fort Knox.

I would find some in-depth details on these aspects interesting, but from a less-clinical viewpoint than the ones presented in the cloud vendors' SOC reports.

dijit · 2024-12-22T22:29:43 1734906583

I’ve never visited a datacenter that wasn’t SOC2 compliant. Bahnhof, SAVVIS, Telecity, Equinox etc.

Of course, their SOC 2 compliance doesn't mean we are absolved of securing our databases and services.

Theres a big gap between throwing some compute in a closet and having someone “run the closet” for you.

There is, a significantly larger gap between having someone “run the closet” and building your own datacenter from scratch.

wutwutwat · 2024-12-23T02:48:02 1734922082

A datacenter being soc2 compliant doesn’t mean any of your systems are. Same with pci. Same with hipaa. Cloud providers usually have offerings that help meet those requirements as well, but again, you can host bare metal, colo, cloud, or a tower under your bed, their compliance doesn’t do anything to cover your compliance.

dijit · 2024-12-23T08:51:37 1734943897

Yes, quite right, that’s what I meant with my “I still have to do the work of securing my services”.

Would be the same no matter where I’m hosted.

Going to guess you meant to reply to the parent though?

bigfatkitten · 2024-12-30T04:33:23 1735533203

They do cover your physical security requirements, which is still important.

jph00 · 2024-12-23T01:19:34 1734916774

You're describing stuff the colo provider does. I have no plans to describe how to setup a colo provider. I've never done that, and haven't seen the need. The cost of colo is not that significant.

milesvp · 2024-12-22T20:00:40 1734897640

As someone who lived through that era, I can tell you there are legions of devs and dev adjacent people who have no idea what it’s like to automate mission critical hardware. Everyone had to do it in the early 2000s. But it’s been long enough that there are people in the workforce who just have no idea about running your own hardware since they never had to. I suspect there is a lot of interest, especially since we’re likely approaching the bring it back in house cycle, as CTOs try to reign in their cloud spend.

e12e · 2024-12-22T22:38:05 1734907085

I used to help manage a couple of racks worth of on premise hw in early to mid 2000.

We had some old Compaq (?) servers, most of the newer stuff was Dell. Mix of windows and Linux servers.

Even with the Dell boxes, things wasn't really standard across different server generations, and every upgrade was bespoke, except in cases when we bought multiple boxes for redundancy/scaling of a particular service.

What I'd like to see is something like oxide computer servers that scales way down at least down to quarter rack. Like some kind of Supermicro meets backlblaze storage pod - but riffing on Joyent's idea of colocating storage and compute. A sort of composable mainframe for small businesses in the 2020s.

I guess maybe that is part of what Triton is all about.

But anyway - somewhere to start, and grow into the future with sensible redundancies and open source bios/firmware/etc.

Not typical situation for today, where you buy two (for redundancy) "big enough" boxes - and then need to reinvent your setup/deployment when you need two bigger boxes in three years.

Voultapher · 2024-12-24T13:55:40 1735048540

Yeah, having something like oxide but smaller would be awesome.

jedberg · 2024-12-23T04:37:04 1734928624

In my 25 years, I've run some really big on-prem workloads and some of the biggest cloud loads (Sendmail.org and it's mail servers and Netflix streaming). Here is why I like the cloud:

Flexibility.

When Netflix wanted to start operating in Europe, we didn't have to negotiate datacenter space, order a bunch of servers, wait for racking and stacking, and all those other things. We just made an API call and had an entire stack built in Europe.

Same thing we we expanded to Asia.

It also saved us a ton of money, because our workload was about 3x peak to trough each day. We would scale up for peak, and scale down for trough.

We used on-prem for the parts where that made sense -- serving the actual video bits. Those were done on custom servers with a very stripped down FreeBSD optimized just for serving video (so optimized that we still used Akamai for images). But the part of the business that needed flexibility (control plane and interface) were all in AWS.

Why would a startup use the cloud? Both flexibility and ease. There aren't a lot of experts around that can configure a linux box from scratch anymore. And even if you can, you can't go from coded-up idea to production in five minutes like you can with the cloud. It would take you at least a few hours to set up the bare metal the first time.

tiffanyh · 2024-12-23T05:53:31 1734933211

When you say “cloud”, are you including old school web hosts that will rent you a dedicated server?

Like OVH, Hetzner or Hivelocity?

Because you can get some insane servers for like $300/month (eg brand new 5th gen Epyc 48-core / 0.5TB ram / lots of NVME) and globally available.

jedberg · 2024-12-23T06:49:39 1734936579

Those could count. But you'll still end up having to do some linux admin, which a lot of people can't do anymore.

The whole point is that the closer you can get to "write code, run code", the faster you can launch and innovate.

ldng · 2024-12-24T09:29:11 1735032551

Linux admin still exists. Except that they are better paid than ever at cloud provider. What you're describing is more payroll flexibility than technical.

jedberg · 2024-12-24T17:41:17 1735062077

How is it not technical flexibility? No matter what talent you have on payroll, you can't spin up a whole datacenter's worth of machines in Europe in less than a day without a cloud provider.

And I mean less than a day from "I think we should operate in Europe" to "we are operating production workloads in Europe".

tiffanyh · 2024-12-23T16:21:10 1734970870

It sounds like you’re describing PaaS then.

bob1029 · 2024-12-23T09:02:49 1734944569

AWS is only expensive if you intend to run a lot of workloads and have a large, competent technical team.

For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant. EC2+EBS+snapshots is a magic bullet abstraction for most scenarios. Bare metal is nice until parts of it start to fail on you.

I can teach someone from accounting how to restore the entire VM farm in an afternoon using the AWS web console. I've never seen an on prem setup where a similar feat is possible. There's always some weird arcane exceptions due to economic compromises that Amazon was not forced to make. When you can afford to build a fleet of data centers, you can provide a degree of standardization in product offering that is extraordinarily hard to beat. If your main goal is to chase customers and build products for them, this kind of stuff goes a long way.

Long term you should always seek total autonomy over your information technology, but you should be careful to not let that goal ruin the principal business that underlies everything.

bigfatkitten · 2024-12-30T04:40:58 1735533658

> For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant.

If your infrastructure consists of ten t2.micro instances vs ten Raspberry Pis, then sure. In any other case, migrating VM or bare metal workloads from your own hardware straight onto EC2 is one of the most effective ways in the world to incinerate money.

You can do well if you've got a workload well suited to 'native' PaaS services like S3 and Lambda, but EC2 costs a fortune.

aragilar · 2024-12-23T09:18:08 1734945488

I'm confused why you would even need AWS then (what's running on the VMs)?

My impression is the standard compute (as in CPUs+RAM) isn't expensive, it's the storage (1 PB is less than half a rack physically now, comparing with the yearly prices listed), and so if you don't have much data, the value of on-prem isn't there.

samcat116 · 2024-12-23T16:25:46 1734971146

For smaller shops I'd argue storage is the hardest part. I've done several OpenStack and baremetal K8s deployments on prem and the part that always stressed me out the most was storage. I'd happily pay a markup for that vs just about anything else that would be more economical to do on prem for smaller simpler workloads.

everfrustrated · 2024-12-23T20:30:28 1734985828

Also encrypted storage on AWS is so simple. Encrypted root file systems on prem is not easy.

bigfatkitten · 2024-12-30T04:46:03 1735533963

How so?

If you're a Windows shop, Bitlocker has been available to you since 2008.

If you're a Red Hat shop, Clevis + Tang has made this a no brainer since 2014.

If you have lots of money and run your root filesystems via FC or iSCSI from NetApp filers, then NSE has been around for close to 20 years now.

bob1029 · 2024-12-23T16:58:26 1734973106

This is it for me too. EBS is a bigger deal than the EC2 instances themselves.

edithpixie · 2024-12-29T17:55:51 1735494951

For many people and businesses, navigating the frequently dangerous landscape of financial loss can be an intimidating and overwhelming process. Nevertheless, the knowledgeable staff at Wizard Hilton Cyber Tech provides a ray of hope and direction with their indispensable range of services. Their offerings are based on a profound grasp of the far-reaching and terrible effects that financial setbacks, whether they be the result of cyberattacks, data breaches, or other unforeseen tragedies, can have. Their highly-trained analysts work tirelessly to assess the scope of the damage, identifying the root causes and developing tailored strategies to mitigate the fallout. From recovering lost or corrupted data to restoring compromised systems and securing networks, Wizard Hilton Cyber Tech employs the latest cutting-edge technologies and industry best practices to help clients regain their financial footing. But their support goes beyond the technical realm, as their compassionate case managers provide a empathetic ear and practical advice to navigate the emotional and logistical challenges that often accompany financial upheaval. With a steadfast commitment to client success, Wizard Hilton Cyber Tech is a trusted partner in weathering the storm of financial loss, offering the essential services and peace of mind needed to emerge stronger and more resilient than before.

basilgohar · 2024-12-22T19:53:22 1734897202

Please do this course. It's still needed and a lot of people would benefit from it. It's just that the loudest voices are all in on Cloud that it seems otherwise.

ksec · 2024-12-22T19:57:21 1734897441

>But everyone seemed to be flocking to them.

To the point we have young Devs today that dont know what VPS and Colo ( Colocation) meant.

Back to the article, I am surprised it was only a "A few years ago" Fastmail adopted SSD. Which certainly seems late in the cycle for the benefits of what SSD offers.

Price for Colo on the order of $3000/2U/year. That is $125 /U/month.

justsomehnguy · 2024-12-22T22:32:56 1734906776

> Which certainly seems late in the cycle for the benefits of what SSD offers.

90% of emails are never read, 9% are read once. What SSD could offer for this use case except at least 2x cost ?

bluGill · 2024-12-22T22:46:22 1734907582

Don't forget that fastmail is through an internet transport with enough latency to make hdd seek times noise

brongondwana · 2024-12-22T22:39:18 1734907158

We adopted SSD for the current week's email and rust for the deeper storage many years ago. A few years ago we switched to everything on NVMe, so there's no longer two tiers of storage. That's when the pricing switched to make it worthwhile.

matt-p · 2024-12-22T21:21:29 1734902489

Colo is typically sold on power not space, from your example you're either getting ripped off if it's for low power servers or massively undercharged for a 4xa100 machine

kapone · 2024-12-23T02:20:59 1734920459

What??

I can get an entire rack at Equinix for ~1200/mo with an unlimited 10g internet connect.

flemhans · 2024-12-22T20:48:05 1734900485

HDDs are still the best option for many workloads, including email.

twotwotwo · 2024-12-23T05:52:34 1734933154

> I've been doing some planning over the last couple of years on teaching a course on how to do all this

Yes! It's surprisingly common to hear it can't work, or can't scale or run reliably, when all that is done. Talking about how you've done it is great from that perspective.

Also, it's worth talking about what you gain, qualitatively! As this post mentions, your high-performance storage options are far better outside the cloud. People often mention egress, too. The appealing idea to me is using your extra flexibility to deploy better stuff, not saving a bit of cost.

0xbadcafebee · 2024-12-22T20:56:04 1734900964

You know how to set up a rock-solid remote hands console to all your servers, I take it? Dial-up modem to a serial console server, serial cables to all the servers (or IPMI on a segregated network and management ports). Then you deal with varying hardware implementations, OSes, setting that up in all your racks in all your colos.

Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...

That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.

The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.

Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?

jph00 · 2024-12-23T01:16:59 1734916619

It's really not like that at all. If it was, I expect after 25 years of growth FastMail would probably have noticed. Much of what you're describing assumes a poorly run company that isn't able to make good choices -- if you have such a mix of odd hardware os OSes then that's pretty bad sign.

Prioritise simplicity.

For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.

It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.

Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.

growse · 2024-12-23T15:23:11 1734967391

> Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.

My experience of this is that people either fall into the camp of having done it under a set of non-ideal constraints (leading them to do it badly), or it's post-rationalising that they just don't want to.

jread · 2024-12-22T22:32:23 1734906743

> Hardware can fail for all kinds of reasons

Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.

0xbadcafebee · 2024-12-23T08:29:03 1734942543

And my experience is the opposite, on both counts. I guess it's moot because two anecdotes cancel each other out?

Cloud VMs fail from either the instance itself not coming back online, or an EBS failure, or some other az-wide or region-wide failure that affects networking or control plane. It's very rare, but I have seen it happen - twice, across more than a thousand AWS accounts in 10 years. But even when it does happen, you can just spin up a new instance, restoring from a snapshot or backup. It's ridiculously easier to recover than dealing with an on-prem hardware failure, and actually reliable, as there's always capacity [I guess barring GPU-heavy instances].

"Server grade hardware in a reliable colo with good uplink" literally failed on my company last week, went hard down, couldn't get it back up. Not only that server but the backup server too. 3 day outage for one of the company's biggest products. But I'm sure you'll claim my real world issue is somehow invalid. If we had just been "more perfect", used "better hardware", "a better colo", or had "better people", nothing bad would have happened.

jread · 2024-12-23T18:21:58 1734978118

There is lot of statistical and empirical data on this topic - MTBF estimates from vendors (typically 100k - 1m+ hours), Backblaze and Google drive failure data (~1-2% annual failure rate), IEEE and others. With N+1 redundancy (backup servers/RAID + spare drives) and proper design and change control processes, operational failures should be very rare.

With cloud hardware issues are just the start - yes you MUST "plan for failure", leveraging load balancers, auto scaling, cloudwatch, and dozens of other proprietary dials and knobs. However, you must also consider control plane, quotas, capacity, IAM, spend, and other non-hardware breaking points.

You're autoscaling isn't working - is the AZ out of capacity, did you hit a quota limit, run out of IPv4s, or was an AMI inadvertently removed? Your instance is unable to write to S3 - is the metadata service being flakey (for your IAM role), or is it due to an IAM role / S3 policy change? Your Lambda function is failing - did it hit a timeout, or exhaust the (512MB) temp storage? Need help diagnosing an issue - what is your paid support tier - submit a ticket and we'll get back to you sometime in the 24 hours.

likeabatterycar · 2024-12-22T22:36:52 1734907012

> The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this)

Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?

It's the same hardware, prone to the same failures. You've just outsourced worrying about it.

0xbadcafebee · 2024-12-23T21:35:54 1734989754

The hardware is prone to the same failures, but the customers rarely experience them, because they handle it for you. EBS means never worrying about disks. S3 means never worrying about objects. EC2 ASG means never worrying about failed machines/VMs. Multi-AZ means never worrying about an entire datacenter going down.

Yes, you pay someone else to worry about it. That's kinda the whole idea.

kapone · 2024-12-23T02:09:48 1734919788

ok...?

But, it comes at a cost. And that cost is significant. Like magnitudes significant.

At what point does it become cheaper to hire an infra engineer? Let's see.

In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.

That's ~$12K/mo.

That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.

The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.

0xbadcafebee · 2024-12-23T03:52:52 1734925972

The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.

But let's not just make vague claims. Everybody keeps saying AWS is more expensive, right? So let's look at one random example: the cost of a server in AWS vs buying your own server in a colo.

  AWS:
    1x c6g.8xlarge (32-vCPU, 64GB RAM, us-east-2, Reserved Instance plan @ 3yrs)
       Cost up front: $5,719
       Cost over 3 years: $11,437 ($158.85/month + $5,719 upfront)

  On-prem:
    1x Supermicro 1U WIO A+ Server (AS -1115SV-WTNRT), 1x AMD EPYC™ 8324P Processor 32-Core 2.65GHz 128MB Cache (180W), 2x 32GB DDR5 5600MHz ECC RDIMM Server Memory, 2x 240GB 2.5" PM893 SATA 6Gb/s Solid State Drive (1 x DWPD), 3 Years Parts and Labor + 2 Years of Cross Shipment, MCP-290-00063-0N - Supermicro 1U Rail Kit (Included), 2 10GbE RJ45 Ports : $4,953.40
    1x Colo shared rack 1U 2-PS @ 120VAC: $120/month (100Mbps only)
      Cost up front: $4,953.40 (before shipping & tax)
      Cost over 3 years: $9,273 (minimum)

So, yes, the AWS server is double the cost (not an order of magnitude) of a ServerMicro (& this varies depending on configuration). But with colocation fees, remote hands fees, faster internet speeds, taxes, shipping, and all the rest of the nickle-and-diming, the cost of a single server in a colo is almost the same as AWS. Switch to a full rack, buy the networking gear, remote hands gear, APCs, etc that you'll probably want, and it's way, way more expensive to colo. In this one example.

Obviously, it all depends on a huge number of factors. Which is why it's better not to just take the copious number of "we do on-prem and everything is easy and cheap" stories at face value. Instead one should do a TCO analysis based on business risk, computing requirements, and the non-monetary costs of running your own micro-datacenter.

BackBlast · 2024-12-23T06:45:35 1734936335

> The bigger cost is what will happen to your business when you're hard-down for a week because all your SQL servers are down, and you don't have spares, and it will take a week to ship new servers and get them racked. Even if you think you could do that very fast, there is no guarantee. I've seen Murphy's Law laugh in the face of assumptions and expectations too many times.

Lets ignore the loaded, cherry picked situation of no redundancy, no spares, and no warranty service. Because this is all magically hard since cloud providers appeared even though many of us did this, and have done this for years....

There is nothing stopping an on-prem user from renting a replacement from a cloud provider while waiting for hardware to show up. That's a good logical use case for the cloud we can all agree upon.

Next, your cost comparison isn't very accurate. One is isolated dedicated hardware, the other is shared. Junk fees such as egress, IPs, charges for access metal instances, IOPS provisioning for a database, etc will infest the AWS side. The performance of SAN vs local SSD is night and day for a database.

Finally, I can acquire that level of performance hardware much cheaper if I wanted to, order of magnitude is plausible and depends more on where it's located, colo costs, etc.

aragilar · 2024-12-23T09:52:26 1734947546

These servers are kinda tiny, and ignore the cost of storage. From the article, $252,000/y for 1 PB is crazy, and that's just storing it. There's also the CapEx vs OpEx aspect.

brongondwana · 2024-12-23T04:29:40 1734928180

Yeah, if you don't have levels of redundancy, then you're pretty screwed. We could theoretically lose 2/3 of our systems and have sufficient capacity, because our metric is 2N primary plus N secondary, and we can run with half the racks switched off in the primary, or with the secondary entirely switched off, or (in theory, there's still some kinks with failover) with just secondary.

switch007 · 2024-12-22T21:48:35 1734904115

This. All of this and more. I've got friends who worked for a hosting providers who over the years have echoed this comment. It's endless.

dataflow · 2024-12-23T06:04:49 1734933889

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

How do the availability/fault tolerance compare? If one of your geographical locations gets knocked out (fire, flood, network cutoff, war, whatever) what will the user experience look like, vs. what can cloud providers provide?

riezebos · 2024-12-22T20:22:08 1734898928

As a customer of Fastmail and a fan of your work at FastAI and FastHTML I feel a bit stupid now for not knowing you started Fastmail.

Now I'm wondering how much you'd look like tiangolo if you wore a moustache.

jph00 · 2024-12-23T01:08:19 1734916099

Now I wonder what he'd look like without the moustache :)

brongondwana · 2024-12-22T22:39:49 1734907189

Jeremy is all the Fast things!

ForOldHack · 2024-12-23T06:27:18 1734935238

" teaching a course on how to do all this..." Can you provide some notice of this so I can schedule my vacation time to fully participate? Let me know when registration is open.

lowsong · 2024-12-23T07:00:47 1734937247

What is the software side of things like? Is your team managing these servers directly — or is it "cloud like" with containers (Kubernetes?), IaC tools, etc.

bob_theslob646 · 2024-12-27T04:33:29 1735274009

I would gladly take your course if you offered it.