Designing a scalable API on AWS spot instances

arecurrence · on July 23, 2020

There's an excellent implementation using AWS lambda to manage spot instances at https://github.com/AutoSpotting/AutoSpotting

What's fantastic about the autospotting implementation is:

1. It dynamically replaces existing instances with spot instances just by setting a tag

2. Rather than replacing the instance with a fixed spot instance type, it will choose the cheapest that fits the requirements.

3. If there are no spot instances available that fit the requirements, it will spin up on demand instances until spot instances are available.

If you have experience working with spot then you will know that these are really outstanding features that hopefully amazon will bake-in in the future.

ocdnix · on July 23, 2020

Turns out this is about EC2 spot instances for ECS. How would it compare to ECS Fargate spot these days?

I'm also missing a discussion about designing for interruption, either by not keeping state, or by being able to shed state quickly, to be picked up by other instances.

Also, if you set up EC2 spot with a launch template or ASG with very differently-sized instance types (to reduce risk of running out), is there a way to even out the load coming through an ALB? The least-connections scheduling can help in some cases, but a connection might not map 1:1 to one unit of load. The ALB can use weighted balancing, but on the target group level. Dunno how easy it would be to allocate different instance sizes to different target groups and weigh them accordingly.

ollyculverhouse · on July 23, 2020

AFAIK with Fargate a lot of this is handled for you, as long as you have the auto scaling group.

We have this setup with two capacity providers (FARGATE_SPOT and FARGATE) with a 75/25% split, meaning that even if there are no spot instances available we will still be up.

The benefit of Fargate being that we don't need to care if certain instance sizes are not available as that is handled by AWS.

l33tman · on July 23, 2020

Cool, when fargate launched they didn't have a spot possibility (AFAIK) and since we run ECS on Spot instances it would just be a massive increase in cost to switch to FG, but if it now can use underlying spot instances, it might be worth looking at again..

Epixors · on July 23, 2020

Yeah spot capacity providers for Fargate only got added a few months ago, been running well for us in production.

pluies · on July 23, 2020

(Not the OP, but running a fairly similar setup, albeit for EKS nodes rather than ECS nodes :) )

Fargate Spot is about a third of the price of Fargate (at least in eu-west-1 now according to: https://aws.amazon.com/fargate/pricing/ ); so the savings are roughly identical.

Re risk of running out, our current strategy is to use different-but-closely-similar instance groups; so for example we have an autoscaling group running a mix of:

- m5.large - m5dn.large - m5n.large - m5ad.large - m5d.large

Which are the same price on Spot instances, but I'd wager it'd be pretty rare to have all these families reclaimed at once.

(We also use some on-demand only ASGs with lower priority in the cluster-autoscaler to ensure that if it _does_ happen, then we'll have a fallback)

jakozaur · on July 23, 2020

For service type workloads (e.g API service with 99.99% uptime SLA) we keep comparing on-demand vs. spot.

In reality, you would like to compare Reserved Instances as you can get 60% discount.

So in us-east-1:

- spot costs: 35-43% of on-demand

- RI 1 year standard: 60% of on-demand

- RI 3 year convertible: 46% of on-demand

So if you have some base load that you can commit to running for 3 years, the price gets often at spot range while not having to worry about losing capacity.

In-reality sometimes combining reserved for some base capacity 60% + 40% spot for spiky seems to be the winning combination for many companies.

kpotehin · on July 23, 2020

Good point, thanks! We're looking at saving plans, probably will use them too. But when you're doing prototypes it's pretty much impossible to commit to 1 year of usage, let alone 3:)

serverlessmania · on July 23, 2020

You can still reserve classic instance for one year and when you don't use it anymore you can sell it in the market

WatchDog · on July 24, 2020

Seems like amazon has long been transitioning spot instances away from being a method of efficiently utilizing excess capacity, towards being a discounted service for less risk-averse businesses(businesses that can accept the risk of their service being terminated at any time).

Even if in practice AWS never sees large spikes in compute demand and corresponding large scale instance preemption, most businesses I've worked with won't accept the risk of having OLTP systems be taken down at any time.

No longer does spot seem to be a service where one can get a bargain for their compute intensive offline/batch workloads that are much more tolerant of preemption.

Given that the spot prices seem to be very flat, and preemption is rare, amazon presumably have a fair bit of underutilized capacity, does anyone know if amazon uses this capacity themselves, or offers more aggressive spot pricing to select clients?

MrPowers · on July 23, 2020

Great article. Cutting ec2 costs is important, especially for companies with heavy data engineering / data science workflows. Spark service providers make it easy to spin up huge clusters (100+ nodes) to perform ad hoc analyses. The costs can quickly spiral out of control, even if you're getting 3x cost savings on the spot market.

Some tangential thoughts:

* Is there an AWS API that returns the cheapest availability zone in a region for a given instance type? Or is the GUI that's screenshot in the blog the only way to see?

* I have seen the 90%+ cost savings for certain instance types

* Sometimes you lose a spot instance, look at the pricing history graph to confirm the price spike, and don't see any spike that was above your bid price... it can be frustrating

luhn · on July 23, 2020

I'm not aware of any API (surprising for AWS), but you use Spot Fleet to get an array of spot instances optimized for cost.

Having a bid above spot price does not guarantee you'll keep the spot instance. AWS can terminate a spot instance at any time if they need the capacity—That's the deal. It used to be more closely tied your bid price, but they've been moving away from that.

makkesk8 · on July 23, 2020

"Our backend system is built on AWS. Today I’m going to tell you how we had cut costs"

This is a recurring topic here on HN and it boggles me and makes me wonder if people know that there are other platforms than aws, azure and google cloud out there that are very capable and much much cheaper.

Unless any of the big 3 has a feature or certification you need I don't see any reason to use them at all due to the insane complexity and cost.

So why do you or your company who uses any of the big 3 use them if you had to cut cost at some point?

CSDude · on July 23, 2020

Always this poor argument pops up. 1st AWS and big 3 is much more reliable. 2nd AWS is not only a VPS provider. Changing vendors is not the answer to all cost related problems.

ethanwillis · on July 23, 2020

It's less reliable than a guy running a website on a raspberry pi with a solar panel from harbor freight http://pi.qcontinuum.com/

qeternity · on July 23, 2020

This is not true at all. We run a large setup in OVH Gravelines and another in us-east-1 and we’ve not had a single outage in 3+ years with OVH. I can’t say the same for AWS.

quicksilver03 · on July 23, 2020

You have been lucky with Gravelines, OVH had major outages lasting several hours in 2017 (Roubaix, total loss of routing for all of the 6 DCs) and in one case close to 24 hours (Strasbourg, cascade of events resulting in total power failure http://status.ovh.net/?do=details&id=15162 ).

In that period (2015-2018) I used to run a fairly well known French website on OVH, and their network was very unstable, from equipment failures to way too many fat-fingering of routes. If you're able to easily switch traffic between OVH and AWS you're in a far better position than most people.

qeternity · on July 24, 2020

We stayed away from Roubaix given the experimental design. Not sure why anyone would put production stuff there. I can’t speak to Strasbourg that sounds horrendous. But Gravelines has been rock solid and 4ms latency to London.

But at the end of the day outages happen everywhere, including AWS. We also have some kit at Hetzner and I think that a redundant setup across OVH and Hetzner will be a fraction of a cost of single AZ setup in AWS and yield far greater uptime.

We’ve commoditized the servers and services (cattle vs pets)...why would we treat the providers any differently? Use cheap components and lots of redundancy.

gridlockd · on July 23, 2020

> 1st AWS and big 3 is much more reliable.

I think that's a myth. People assume it's true because it should be true. I don't think it is true.

> 2nd AWS is not only a VPS provider.

It's a glorified VPS provider. Most of the stuff doesn't matter to most of the people using AWS, but they go for it so they can put it on their resume and because they don't want to get fired for choosing something that's not a big name.

jungturk · on July 23, 2020

EC2 might be a glorified VPS provider, but you seem to be ignoring the vast array of managed services in modern cloud providers (or unaware of their utility beyond padding resumes).

Load balancing, fault tolerance, high availability, arbitrary scale, messaging, orchestration, autoscaling, warehousing, big data processing, identity management, desktop management, secrets management, container registries, source code management, build tools, hardware test suites, gpu hardware, observability tools...

Those of use that use cloud providers know full well why we use them (and certainly know when not to).

gridlockd · on July 23, 2020

If you really need any of that stuff you probably shouldn't use Amazon's managed version of it.

rospaya · on July 23, 2020

Why? I run both on prem and cloud workloads (on more than one provider) so I'm wondering what's wrong with Amazon's managed services?

MaxBarraclough · on July 23, 2020

> People assume it's true because it should be true. I don't think it is true.

It makes sense that bargain-bin providers would offer inferior reliability, but there are providers out there other than the big 3 cloud providers and the bargain-bin VPS providers.

GitHub for instance is apparently [0] hosted by Carpathia [1].

[0] https://github.com/holman/ama/issues/553

[1] http://www.carpathia.com/ (they should really fix https://www.carpathia.com )

_qwfv · on July 23, 2020

There should be data available for #1 -- it's not something we should need much subjective discussion around.

Re #2: I love all the other stuff. I never use EC2. Lambda, Cognito, DynamoDB, S3, CloudFront, Route 53, and API Gateway are my default stack, managed by Cloudformation. Granted, I'm doing smaller projects, but the costs are minimal, the setup time is trivial, the documentation is excellent, everything is nicely compartmentalized and 'just works' together. And I only pay for the actual traffic to the site.

parsnips · on July 23, 2020

This is the way. We've given multiple clients very similar stacks for "api layers" and it's by far the most cost effective way to do AWS from both productivity and opex point of view.

jorams · on July 23, 2020

I can give you the reason we moved from Digital Ocean to Google Cloud: Spaces (object storage) was ridiculously slow and unreliable. There were several incidents per week where Spaces had bad connectivity or simply seemed to have crashed entirely, and that's not something we can deal with for production traffic. We now run a cheap CDN in front of it to cut down on the ridiculous bandwidth costs, but Google Cloud Storage has been very reliable and fast.

saddlerustle · on July 23, 2020

This. There is no alternative for high performance object storage outside of AWS S3 and Google Cloud Storage. Even Azure's offering is wonky.

Lots of providers claim to offer object storage, but try hitting them from couple thousand cores and they all tend to immediately fall over.

fulafel · on July 23, 2020

There's options outside SaaS object storage though, eg running your own MinIO or Ceph, + DAS (or SAN) at your colo. Yes it's some hassle but if you have significant expenses from SaaS it might be worth checking out.

dijit · on July 23, 2020

AWS S3 buckets have a limit of 5gb/s access to a single bucket.

Unless that limitation has changed in the last couple of years; I can easily make a system beat that, if that's the requirement.. and ultimately it does come down to understanding requirements. :\

I think generally people forget that cloud is just computers too, it's really not anything special, and amazon/google/microsoft are solving the general case (and, doing so well, actually) but this comes at a high premium.

duggan · on July 23, 2020

It's 25Gbps now[1]. I have a rule of thumb: if I haven't refreshed something I think I know about AWS for more than half a year, it's probably out of date.

[1]: https://aws.amazon.com/premiumsupport/knowledge-center/s3-ma...

saddlerustle · on July 23, 2020

That the throughput available to a single EC2 instance. S3 as a whole will do hundreds of Gbps no problem.

speedgoose · on July 23, 2020

That's not a very common use case though. Most companies don't need to DDOS their cloud provider.

jrockway · on July 23, 2020

25Gbps is 250 cable Internet users downloading the release of your new software. What you call a DDOS is actually a woefully underwhelming instantaneous transfer rate.

Epixors · on July 23, 2020

If you're distributing releases (or anything, really) users should not have to go directly to your object store...

speedgoose · on July 23, 2020

You should put your objects storage behind a CDN if you expect that many downloads. That's much cheaper and faster.

jrockway · on July 23, 2020

I run my own Docker registry for my personal projects backed by DO Spaces. It was so slow that I couldn't actually believe it, and had to instrument the registry code extensively to account for all the time spent in the storage API, to see what the actual problem was. Turns out Spaces is really, really slow.

csunbird · on July 23, 2020

is it that bad? we are thinking about bootstrapping a project, which will use spaces for file storage and we will be using them frequently.

barake · on July 23, 2020

Provisioning of new Spaces was recently disabled in some regions while they add capacity. Vaguely concerning IMO and has caused me to seriously reconsider using DO Spaces for a new product I'm building.

https://www.digitalocean.com/docs/release-notes/upcoming/spa...

csunbird · on July 23, 2020

I do know that they have limited the creation and looking at the docs for limitation, it seems like they have a 750 requests per second limit per space, which might not scale well. My idea was to create multiple spaces to distribute a load, e.g. sharding per customer, but if the availability is not that good, I might have to re-consider our choice.

The main reason we wanted to work with DO is that the egress traffic cost is reasonable, unlike AWS or Google Cloud where it is questionably high.

jorams · on July 23, 2020

It was that bad 2 years ago, in the AMS3 region. I don't know how it is these days, but I expect after 2 years it has matured at least a little bit. They also now include a CDN, which might improve things a bit if it reduces the load on the buckets behind it.

Honestly it soured my opinion on DO a bit. I'd previously had a droplet there for several years, which never seemed to experience any issues at all.

csunbird · on July 23, 2020

In our case, we can not use the CDN, unfortunately, because the objects change a lot.

jonatron · on July 23, 2020

I tried it for a small personal project, just for somewhere to put some small images. On pages of about 30 images, usually one of them just wouldn't load. It didn't look reliable to me either.

l33tman · on July 23, 2020

I'm sure there are other services that are cheaper in some of the areas, but at least the argument about "insane complexity" is moot; if you've used AWS before it's no problem to set something up from scratch. I've done it numerous times and it's a one day job to setup most of the infrastructure required with VPCs, IAM users, policies, databases, scaling groups, storage configs etc.. That part of it is not really an argument to switch at least.

PetahNZ · on July 23, 2020

Tell me who/what I should migrate my Elastic Beanstalk/Aurora Serverless/S3 stack to to save significant costs?

dijit · on July 23, 2020

The solution to saving costs is to go for cheaper cloud providers or run k8s on VPS's/colocated physical hardware.

I'm not sure where the idea came from that this is hard to do.

To answer your questions in order:

1) People warned aggressively about lock-in of cloud providers with proprietary extensions and you must have chosen not to listen; so, I have little sympathy.

I'm not saying it's black and white; but I hope that you got your velocity required to hit market faster and have made more than you spent because this is the price you paid; and now to get out you'll have to invest a little time on cleaning house, that's the reality of lock-in.

2) S3 is pretty easy as there are "s3 compatible" FOSS projects; min.io comes to mind, or ceph with a RADOS plugin, or Riak with the s3 plugin... there's also s3proxy with a multitude of backends..

3) Aurora serverless is replaced by knative

4) Beanstalk is just classical servers with an auto-scaler component, auto-scaling depends greatly on your provider, so I can't say how easy or hard it will be, if you're using kubernetes then understanding your load should be easy at least.

marcinzm · on July 23, 2020

Now you have five more things you need to research, learn, keep track of, debug, etc. Plus, in my experience, anyone who says kubernetes is easy to run (cloud or native) has never had to run kubernetes in production.

dijit · on July 23, 2020

Sure, then pay?

I don't understand the argument: "I want to outsource understanding but I want to save costs";

You can think of things as a spider-graph of three points:

Quality -- Low-Cost -- Knowledge-Required.

No solution can score high points on all three reasonably.

Anyway, things are a bit skewed because I'm an infrastructure type, and people in my profession really do think of systems administration tasks as being "very easy" and if done right soak up nearly no time at all, but developers don't like hearing that because sysadmins are "old world".

I don't really care if you're paying someone elses sysadmins or not, the fact remains that you're going to be spending something in that area, and if you balk at the cost of cloud then maybe taking ownership of what they do can help optimise costs.

Obviously they put a premium on their own time in these areas.

aripickar · on July 23, 2020

I disagree. Once you increase your knowledge of AWS and associated systems, you can decrease cost of what you are doing through tips like what you see in this article. I don't quite get the point about optimizing cloud costs solely by switching to self hosted? Like of course you could do that, but you could also optimize costs by doing what the article says.

Full Disclosure: I work at AWS building tools to help customers do cost optimization.

dijit · on July 23, 2020

Someone made a good reply to this and as I was replying to that they deleted it, so I'll copy it here:

----

> Have you actually done a cost-benefit analysis on some of these solutions?

Yes, I even gave a talk at google in stockholm about it.

For my use-case, hybrid was best, with no cloud lock-in aside from Google Storage Buckets (which can be replaced) but I went into detail about that in the talk.

> Take your Riak / S3 plugin. What do your servers cost to run that cluster?

Depends a lot, don't you think?

> How much time do you spend managing it?

Depends again, if it's anything like my elasticseach clusters then about 2-man hrs/mo.

> How do you test your backups?

Continuously, and with alerting.. and, you should be doing this anyway.

> Are you going to target the same SLOs for durability that S3 offers?

Depends on the business, the whole point of SLO is that you pay in what it's worth to the business.

> Do you run multi-data center for high availability?

Depends on SLO.

> In many cases the cheaper or self-hosted solutions have costs that you aren't accounting for.

Yes, physical machines often need some hand-holding, VPS's can have brown outs, but this is true in AWS's EC2 anyway.

Ultimately, this is where the cost increase will be.. but defining it is important, I've deployed cloud and physical (as stated) and it's true that physical machines are not as problem-free as our GCE ones- but we pay about 50% less than the GCE equiv instances, so it's "worth" spending time automating the unpredictable.

> Sometimes that's fine, but "just run it yourself" is as worthless as saying "just ship it to AWS" unless you actually think through the impact.

This is kind of the main point I always make.. understand your trade-offs, don't buy into proprietary tech. Cloud is a fantastic way to prototype and bootstrap but it's /usually/ better to have a migration plan to optimise costs in the future.

If you fail to take that into account then I don't have sympathy for you, because you put the project at risk. Financial in-viability is a risk.

owenmarshall · on July 23, 2020

That was me ;) I deleted it because I read your second post and realized I'd completely misread you and actually agree almost in entirety.

> understand your trade-offs, don't buy into proprietary tech. Cloud is a fantastic way to prototype and bootstrap but it's /usually/ better to have a migration plan to optimise costs in the future.

Preach.

pm90 · on July 23, 2020

I don't think moving cloud providers is an effective way to solve for cost problems; especially considering the engineering cost involved in making these migrations, which reduce the marginal utility of such a migration to 0, possibly even <0.

If you're really looking to save on costs, hosting solutions (i.e. hosting racks or some managed solution) are probably what you would need to look at. But then there are other costs involved there as well (infrastructure team, upfront capex costs etc.), but it might still be worth it if you run e.g. a ton of batch processing over a ton of data.

marcinzm · on July 23, 2020

>So why do you or your company who uses any of the big 3 use them if you had to cut cost at some point?

Because it takes X hours of effort to cut costs and using another provider would have required Y hours of effort where Y >> X. Y may be greater due to reliability issues, missing features that you need to build yourself, training costs of new employees, etc.

edit: Also X is paid once you're succeeding, Y has to be paid before you're succeeding which makes Y even more expensive in opportunity cost.

yellow_mixer · on July 23, 2020

It’s also much easier to hire engineers to develop and maintain theses platforms.

secondcoming · on July 23, 2020

It's probably because when devs jump into the cloud they can be distracted by Shiny New Toys and cost is only considered once the first bill comes in.

ditansu · on July 23, 2020

Very useful article! waiting a terraform best practice

lyalu · on July 23, 2020

interesting article!

sunilkumarc · on July 23, 2020

Very well written article!

If someone is interested in learning all the AWS concepts, here's an awesome e-book which is written by the legend Daniel Vassallo himself.

https://gumroad.com/a/238777459/MsVlG