Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Did the cloud made us over-engineer some systems that could have been simpler? (twitter.com/dominicstpierre)
42 points by alexzeitler on Dec 19, 2023 | hide | past | favorite | 67 comments



The cloud has eaten all the hardware improvements of the last decade. It's insane that my base model Macbook routinely outperforms the (expensive!) AWS crap that most companies use as their application "servers" and yet it's the reality in many places.

So to a certain extent, you are forced to over-engineer for distributed workloads because any single node has the performance of a pocket calculator. This obviously benefits the cloud provider because not only do you rent more resources from them but you now rely on their management platform to orchestrate all those workloads.


The cloud was principally designed 2005-2013 era where it was pretty reasonable for an average website to need to go multi system to support traffic spikes.

CPU hardware kind of stagnated 2014-2018, while SSDs took off. But the last 5 years we've seen some pretty massive improvements everywhere.

How many sites need more than 500 threads, 6TB of ram, and 100Gbit egress? Which you can get in a single system now.

Some do, certainly, but then we're talking about the tiny narrow edge on the upper end of the curve. Almost every new web system built today can fit on a single box, and can become massively profitable on that single box unless you unicorn out and need the new architecture which you can now afford. And... that should be the new default.

But... with the cloud, we're stuck with complex massively parallel services or people starting up on tiny rented instances that are, as you say, pocket calculator equivalents.


yep, at most you might want 2 boxes for HA but you definitely don't have to start with that and it also depends on what you're doing.

I know some are going to argue that it's a slippery slope from that simple HA to DR and all the other complexities, and there's some truth in that, but that complexity can be controlled with explicit decision making. It's not obvious that a simple HA/DR story has to end up with modern day cloud architectures.


My r7g.16xl absolutely smokes my maxed out m2 max 16". I had to spin it up as I was sending my m2 into thermal shutdown. Task that takes 2.5 hours on the m2 max takes 8 minutes on the r7g. I'm crunching about 16tb of log files streamed from s3 and collating to parquet. I even ran it from the office on Ethernet on Amazon's corp net and it was still much faster on CPU and net. I even tested it with everything locally on disk. Not even close. The r7g is cranking along at 5.2ghz all cores all day.


> My r7g.16xl absolutely smokes my maxed out m2 max 16"

yes, because its apples(50-80watt TDP) to oranges(200-600watts). but also that's not the point. All those layers of microservices, or, pushing transactions to s3 slow everything down.

instead of a beefy machine that takes a request and does virtually all of it locally, it need to jump out to either a local sidecar, or another service in the cluster. At best it'll be some message bus, at worse it'll be something over k8s batshit networking scheme.

s3 is fine for streaming medium files, but as soon as you start modifying, boom, massive overhead. any change and you are re-writing the entire file. Than might be ok for your needs, but for others, its a pain in the balls. Lots of small files means that you are waiting for s3 io. S3 has improved significantly, and it can stream reasonably fast, but its not a _fast_ object store.

now, of course, its got 30gbits a second network. great, but if you're wasitng that on writing out whole files for every single file change, you'll soon eat it up.


The GP claimed the mac was more powerful than the cloud instance, which is what I responded to with an apples to apples equivalent. And yes it's Apples to oranges, which I'm agreeing with. Your point at the top is exactly what I'm getting at, a top end mac really can't compare to a much more powerful arm box with much more ram, much better network and much better cooling running a lot hotter.

I'm not sure why you're jumping to some rant on k8s which I also dislike and don't use.

The M2 Max is real fast compared to the intel macs and I live it cause it rarely runs into cooling issues as it is just a lot better. But you can make it heat up and when you do it tanks in speed to an equivalent percent as the intel boxes do. Especially when compared to a machine that's literally got a leaf blower running cooling to it.


I don't doubt you can find an instance that smokes your average workstation. The question is how much more does it cost over just buying the equivalent machine outright.

Is the markup reasonable, is it crazy, or is it an outright scam such as charging for egress data? Knowing AWS, it definitely won't be the first option, because why would it be if people are willing to pay more?


Ran the numbers for you.

On-Demand: $3.43/hr ($2503.90 monthly)

Reserved: $2.27/hr ($1657.10 monthly)

M3 Max 16" (all upgrades; to be generous): $7199.00

means about 3months on-demand or 4.5~ months reserved.

Not nearly as bad as I was expecting to be perfectly honest with you.

I did similar calculations recently and it ended up being 3 months for equivalent server grade hardware (IE; 256GiB ECC ram, 64 physical threads; 8TiB NVMe) -- so I bought the hardware and since then have saved more than 3x the cost.

It seems not to be the case with Graviton instances I guess. :D


I work for an amazon subsidiary, I get hella discounts as well.


@dijit what server hardware did you get?



don’t forget spot pricing!


My mac cost like 7k with taxes.

the R7G costs a lot less as I can turn it off and return it when i'm not using it. To do that processing job I'm paying less than a few bucks... less than that as I work for an company with aws discounts. At the rates I get for those machines it'd take several years to equal the 7k or so my mac cost when processing. And it's again absurdly faster. When you take into account actual streaming which is unfair as the data is in the same data center with the cloud instance, the job on the mac took 4 hours just to stream the data in and map it, and around 4 hours to reduce it. Both stages of that on the r7g were around 16 minutes. Again unfair as even though I have gigabit at home I'm still pulling 2-4x the throughput in the cloud instance from S3.

My workloads aren't generally egressing tons of data from AWS, Im not running a file sharing service... that's not the majority of our costs. I'm not just bulk serving data from S3, I'm doing a lot of processing on data.

anyway, I provided as close to I can on numbers. Take it or leave it on if you believe me.


That machine costs around US$2.5k/month. It gets expensive very quickly.

In the early days of the cloud, part of the markup was justified because you wouldn't have to do maintenance, sell old to buy new, etc. At these rates though, I fail to see the appeal - at least from the point of view of cost per compute power.


Sys Admins and OS Image teams get expensive quickly. So does software licensing. When we moved to AWS we saved a fortune - and no, it wasn't in server costs.


Are you saying you don't need those because your compute is in the cloud? Who's operating those VMs? Who's hardening the OS images?

What about software licensing? Whatever license is required to run a software on-prem is probably exactly the same to run it on an EC2 instance. Running in the cloud doesn't suddenly absolve you from having to pay licenses for the software. That would be piracy.

This particular EC2 instance costs 2.5k/month. The MacBook in question costs what, 2x that? 3x? Let's exagerate and say it's 5x, so US$12.5k once. That machine is likely to last years in operation. Are you saying all those years at 2.5k/month for each instance is not enough to pay a team to operate them?

Note that I'm not against doing things in the cloud. Many times, perhaps even most of the time, it's the right decision. That doesn't mean it's always the right one.


Every company I've been involved in had some full-time DevOps people wrangling YAML files. I'm not buying the argument that cloud providers reduce manpower demands. PaaS sure, but not cloud.


Aye. I love the following three-step solution for compute workloads:

1) Make something right. 2) Reduce RAM requirements. 3) Apply GNU Parallel.

It's magical how just much batch work can be done one manycore system when one keeps it hot.


Even more crazy that single AWS server class has 30% difference in performance between instances depending on performed tasks.


I wish I had a time machine, go 15 years in the past and could watch the expression on peoples faces when I tell them: "You think J2EE with EJB has gone too far? Let me show you this microservice application architecture and the accompanying tracing, observation, instrumentation, service mesh and sidecars from the future! But wait, there is more! We have a thing we call....Devops!"


... And yet most of it still uses Jenkins in there somewhere!


"It became so complex that we had to invent a use a new word: 'orchestration'"


Oh yes, the great times when root was a god. /s


   ssh root@master.lan
   
   vi /var/www/index.php
Bug fixed boss!


I know this is a joke and an objectively terrible practice in many ways - but I still do this on many personal projects and the productivity boost feels amazing.

Sure, it doesn't scale and there a a host of problems with it. But I still long for the days.


Ok, let's talk about why fixing it directly on production is a bad idea. Again... deep breath


Ok, let's talk about why fixing it directly on production is a bad idea. Again... sigh


Yes, I’ve been freelancing for a company making a couple of billions in revenue here in Denmark, where basically the most critical infrastructure is running on maybe 2-4 windows boxes with a big mssql database backing it. They are serving a lot of daily transactions because they among other things primarily sells gas.


On-premises simplifies everything, from backups to response time. The only big advantage of the cloud is scaling on-demand, and managed applications like dedicated Wordpress instances. For stable, general-purpose computing workloads, on-premises rules.


This case is also extremely simplified by the fact that the customer only serves users nationally in Denmark. They even got away with running their own fully redundant DC in the basement, and i suspect even with really great latency to customers, compared with running of servers in Germany for instance.


Yes of course. If we could have run our own custom data centers, our software could have been simpler.

What you’re not factoring in WRT the cloud is how complex maintaining an air conditioning unit is, or servicing diesel generators. (List continues for several hundreds of man hours a week of work).

Your company has saved tremendous complexity. Your software has eaten some of that complexity.

Yes, keep it simple when you can. No, major applications that serve millions of users globally can’t run on your laptop “because SQLite is fast!”.


Hosting providers have been managing uninterrupted power and airconditioning for decades at reasonable costs. There's options for every budget, from the cheapest (Hetzner, OVH) to more expensive (Equinix Metal, Deft, etc).

I'm not sure why this trope keeps getting peddled again and again. Is it obtuse career-driven thinking (someone's salary depends on pretending not to understand?), or am I really getting that old that there's an entire new generation of engineers not aware how hosting was done before 1000%-markup "cloud" providers entered the industry?


I mean, I don’t believe my resume should be required to backup my point, but I literally maintained AC units and refueled diesel generators in datacenter as a kid. Obviously there’s more to it than A/C: and yes Equinix does a lot, but it’s the same trade off: slightly more complex software to eliminate non-company-specific specialities. Yes it’s worth it, my evidence is that infrastructure engineers I’ve met aren’t stupid.


I don't understand what in your mind makes overpriced cloud good while non-overpriced cloud (or "hosting" as we used to call it) bad?

Both manage air conditioning, power, fire suppression, etc and ultimately live and die by their uptime, so they have an incentive to do it right.


That wasn't the comparison. It was cloud or no cloud. Running your own hosting infrastructure from the ground up is a considerable complexity. AWS is not valuable because people are too stupid or lazy to run their own systems.


There's a lot of nuance in cloud or no cloud. You jumped right to AC and diesel.

You can have a pretty sizable server capability on modern hardware before you need to touch that.

I can serve what... 1M users? perhaps 10M users? before my server equipment needs it's own AC unit. Depending on the system needs, architecture, complexity, and perhaps language. I can get 100Mb-1Gb of egress bandwidth pretty easily at most residences, let alone business locations.

I can buy battery backup capacity for cheaper than ever, can probably skip the generator. If you go multi region maybe even skip the deep backup power, for just enough to get clean shutdown, and just fall back early.

It's never been easier to do hosting than it is now. If you get your head out of the clouds anyway.


All of this. None of the physical or network stuff requires aws. Not even close.


Still, if your software is data-center scale, it's better to build your own data center and hire the needed expertise. You have more control of what happens there (desired and undesired), and it's probably going to be cheaper, because you still need most of that expertise to herd your numerous cloud instances.


Even Cloudflare isn't building their own datacenters


The problem isnt over-engineering, it’s not engineering at all.

How many teams design for scale and then forget the number? The assumption that the entire system needs to scale to the population of the planet in 2050 workloads? Most developers wont even napkin math 3pp API calls. Theres no upper and lower bounds on things. Theres few constraints other than “use AWS”. Its the engineering equivalent of saying “use mcmaster-carr” then shrugging and walking off.


FCGI and Go connected to Postgres/MySQL can handle quite a bit of load in a very simple environment. You can run those on shared hosting very cheaply. Do you really need more?

Remember Soylent boasting about their "tech stack"? It was just a shopping cart for something that sold a few items per minute. They could have run that thing on a starter account on Dreamhost or HostGator.


Yup. That’s the cloud working as intended: every slightly different functional component split into its own AWS (or whatever) service, with its own IAM roles and billable usage.

Then we decided it wasn’t enough and we wanted kubernetes in our cloud architecture too.


There are plenty of AWS services that have nearly totally abstracted away complexity and work mostly out of the box for simple use cases.


Wierd that this is flagged.

I was called in to fix an ETL job that took over 24 hours using all the cloud services. Wrote a go program that did it in less that 15m.


Unjustified flagging in my opinion. It would be nice if flagging required to select which guideline is being broken, so others can see the weighted reasons for flagging.

More anecdata: we have cloud backups that take hours, while on-premises backups of the same VM take minutes. The quirks of cloud have added complexity to our housekeeping.


My theory is that the HN audience is filled with:

A) People working at cloud providers who have a financial incentive to demerit any anti-cloud sentiment

B) People who do not know what life was like before cloud and think it means that developers will have to go and build datacenters

C) People who don't really care about the cost because it's their employer that foots the bill, they just know that working in cloud pays more so they want the gravy train of inscrutable services and will ride the sunk cost fallacy.

D) People who genuinely don't understand how servers work and it scares them that someone might have different technical knowledge than them in the company... and they might be able to say no to them.

I could be wrong, but if any of the above is true then it explains why people keep bringing out the same tired arguments that show no weight (more staff/build datacenters from atoms of sand) and articles like this get flagged immediately.

---

of course it could also be that the discussion becomes a tired circlejerk and lacks any intellectual curiosity, however I don't believe that.


Keep in mind that the number of devs that are inexperienced is very high. Growth in the industry has been explosive. So there is a large population of devs that just don't understand how one might reasonably live, and even prosper, on the literal server in the basement or closet. Doing so sounds scary and arcane. They have never experienced the old memes of tripping on server power cables and backhoes accidentally digging through your network line being among the more dangerous events in life.

Among the older devs, a sizable amount of them worked in large companies with ineffectual IT departments where the cloud also meant scrapping the red tape in favor of shadow-IT and actually being able to get stuff done. They, largely, think the cloud is the greatest because it materially improved their lives in poorly run organizations. So it simply must be grand.

With these motivations you can come up with any number of arguments that sound plausible to defend the cloud.

There are a few people with significant experience on both sides, that advocate for cloud. I must conclude that their experience largely falls into the few buckets where it does tend to make sense.

My experience is that cloud is insanely expensive - if you can run and keep a competent IT dept.


I've often thought "what is the minimum stack, with minimum moving parts that just scales endlessly?"

If you reduced all the data flow of your distributed software (10,000s of machines, 100,000s of containers) to the number of positions that where each value is at to just an integer, could the position and movement of all values be traversable on a single thread on a single machine and calculated in a tiny amount of time? What's the O notation to your system's data flow?

I've this intuition and feeling that logistics and movement underlies the complexity of computer systems and that it's the shipping bits and dispatching everywhere is the complex part.

If we could write a formally verified state space with software similar to TLA+ and formal methods, we could generate all the plumbing.

Compilers work out how to place things in a calling convention, REST APIs and event queues schemas are just the same thing except we do it manually.


or just buy an IBM mainframe.

let the hardware and OS figure that shit out.


They are complex for a reason. There was an engineer who wanted to add that thing to their resume


Yes, and no.

on the one hand lambda means that you almost nothing in view between you and an HTTP endpoint.

However, a lot of "cloud engineering" is based on two limitations that were inherent to AWS:

1) no machine migration (ie the ability to move a VM to a new host without downtime)

2) No large scale posix file system.

this forced people for good or ill to push lots of things to s3 (a slow, inflexible but ultimately easy enough for 70% of cases). It also meant that people started to go batshit for storing state in unusual ways. Allowing for recovery when having to rebuild machines.

Has EC2 had some of the features that VMware had at the time (I suspect xen had it too, but it was too hard to scale or secure) the world would look more like mainframes, rather than "cloud".


Start looking at Google cloud.

They have VM migration built in and it's transparent as heck. Basically seamless. You might miss a single packet (in my observed experience). This was one of the main reasons I chose them.

They also have managed NFS, but I can't speak to the quality as I've never used it: https://cloud.google.com/architecture/partners/netapp-cloud-...


Its always been possible, its just AWS offered something good enough first.

its a lesson for us in programming when AI starts to automate large parts of it. Being techincally the best is nothing when you can replace slow hard to build things with shitter but significantly faster to deploy things.


Depends:

No, if the goal is job security.

Yes, if the goal is to build the product.


Excuse me brother, did you forget DevOps scripture may not be uttered outside sanctified Discord churches?


Microservice architecture is not the same thing as Cloud architecture.

Plenty of companies run IaaS heavy applications in the cloud at scale, and do so cost-effectively. There is some extra overhead in writing automation around it, and as well performing cost management (cloud finance roles are a new and weird job title, for example). For sure I've seen it done. It's not necessarily the easiest thing, but it's doable.

Just like I've also seen Microservice architecture outside of the cloud. And monolith architecture inside the cloud.


The cloud didn't make anyone over-engineer things. The same folks who overused all the patterns mentioned would have overused other patterns in other contexts if those were the dominant paradigms instead.

Microservices are great when you have different workloads with vastly different performance needs. They're not great when you essentially have a single function running on an entire service. Many things will always be over-engineered because simplicity is hard.


> Do we really need event-driven architecture, serverless functions, hundreds of micro-services for a web application that barely have 1k concurrent users

This is verbatim the thinking that led to needless overengineering - focusing on number of concurrent users as the only requirement and metric of success.

You may require event-driven architectures, serverless functions, etc to meet business requirements, regardless of how many concurrent users you have.


No. No more than anyone made Motorolla over-engineer the first cellphone. We have been over-engineering long before the cloud and will continue long after.

Did it make it easy to over-engineer? Yes.

Is it more common for things to be over-engineered? Probably.

Does this question gloss over the actual cost of getting something running that is useable by more than a handful of people? Yes.

Would I ever want to go back and write code like we did 20+ years ago? Nope.


I would like to take a stance on one aspect of this argument, namley costs.

If you design it smart than your application can scale both ways not just up, but also down, even down to zero in some cases. So as soon your app is not used anymore or not used as much the costs should go down. If you buy the hardware or rent it and calucluate the costs up front for one year, then its not that easy.


Betteridge’s law applies here.

The underlying problem is not “the cloud” but whether your organization has a handle on managing complexity and costs. 20 years ago, the same “architects” were telling everyone you had to build clusters of enterprise Java application servers where no stack trace is less than 200 levels deep. I certainly saw plenty of apps with millions of dollars in hardware load balancers, expensive licensed Java app servers, Oracle HA databases, etc. which were comically outperformed by a Linux box or two running PHP and MySQL (which usually had better reliability, too, since there was so much less to break).

Cloud applications can be simplicity wins or losses depending on your business culture. If you keep it simple, you’re running a couple of Lambda functions or something like an ECS Fargate container, S3, RDS, etc. and saving a ton of time versus both the places rolling their own Cloud-ception K8S platform _and_ the people running a fleet of full servers in their own data center. In both cases, it’s a question of whether you’re adding complexity because the business problems require it or because it’s more fun to work on / looks better on your CV.



cloud is good. it is a new option. most interesting new options come inescapably with new potentials for abuse.

personally, my most interesting work is bootstrapping nights and weekends. the optionality of cloud opens entire new fields here. this is good.

in day to day enterprise engineering, cloud isn’t as disruptive.

i see the inefficiency in enterprise cloud as a failure of the education system.

what could have been learned in throw away school projects is instead learned in throw away enterprise projects.

market inefficiency aside, it’s fine.

cloud isn’t the first footgun, and won’t be the last. education remains the only answer.


“hundreds of micro-services for a web application that barely have 1k concurrent users.”

That would not be appropriate in any of the mainstream cloud-native architecture styles.


Unless someone convinced someone that they should prepared in case they become BIG and get millions of concurrent users.

And even then, nothing is sure. During the COVID-19 pandemic, Pokémon Go, thanks to an adaptation of game mechanics, literally exploded in users. Despite being world-scale, they struggled for a couple of weeks to meet the workload. World-scale naturally degrades to normal-scale if you don't stress it regularly.


> That would not be appropriate in any of the mainstream cloud-native architecture styles.

I mean it ends up being like that. A significant number of medium sized sites bust out into micro-services at some point. Or they start off as serverless, and realise that actually its not as easy to scale as it was claimed.


Yes.

Partly because of ephemeral nature of most cloud platforms / services, and also the vagaries of proprietary specific configuration.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: