The AWS Spend of Our SaaS Side-Business

thesandlord · on June 9, 2017

Disclaimer: I work for Google Cloud

> Cloud solutions like elastic load balancing are generally engineered for the average use case. Think about the ways you are not like the average.

This might be the case for some things, but many of Google Cloud's "elastic/serverless" products (Load Balancers, App Engine, Pub/Sub, BigQuery, Datastore, etc) can truly scale up from nothing to huge traffic spike without the need for pre-warming and things like that.

nodesocket · on June 10, 2017

Came here just to reiterate this. Google Cloud Load balancers don't need warming, manual creation of support tickets, etc.

Also, networking on GCP is vastly superior to AWS. The VPC is setup with sane defaults and VM instances can communicate privately to other VM instances in any zone or region without needing complicated constructs such as VPN's or NAT instances.

Lastly, Google Cloud is generally cheaper (sustained use-discounts), higher performance, and takes all the lessons learned from AWS and applies them.

The one knock I have on GCP is billing. It is incredibly obtuse. Even the monthly billing invoices I still have a hard time digesting (especially if you have multiple projects under a single billing account).

manigandham · on June 11, 2017

> The one knock I have on GCP is billing. It is incredibly obtuse.

Really? We've found it to be much easier. The billing section has up-to-date current breakouts and a simple list of transactions for each billing account. There are budget alerts and the best part is the easy export into bigquery which then gives you SQL access for all of it. Add in datastudio and you have BI style analysis and visualizations too. Better than anything else the other clouds have.

Video from GCP Next: https://www.youtube.com/watch?v=qL8kvTb4RbU

tyingq · on June 10, 2017

Any AWS users have a better experience with their newer "ALB" load balancer? It doesn't specifically call out any performance benefits over the legacy "ELB", but since it's much newer, I'm curious if it happens to do better.

aluminussoma · on June 10, 2017

Yes. Although Administration / UI is slightly more confusing. We had unique performance issues with the ELB offering that couldn't be fixed because it was the way it was designed. With ALB, that performance issue no longer existed.

mwcampbell · on June 10, 2017

ALB may be more complicated, but it's also more flexible, and I've especially come to appreciate that in the past couple of days.

By the way, it looks like ALB might be based on nginx, based on the error pages. I wonder if classic ELB is based on HAProxy.

NicoJuicy · on June 10, 2017

This... To get even the invoices of Google apps ( suite) I have to download 12* 3€

encoderer · on June 9, 2017

When I read that your load balancer is not individually provisioned the way Amazon's is I was envious. I couldn't imagine migrating to a new cloud provider now but I'm really impressed with what Google has put together!

shusson · on June 10, 2017

> I couldn't imagine migrating to a new cloud provider now

ugh vendor lock-in is the worst :)

cobookman · on June 10, 2017

You should talk to a gcp sales rep. They can help you move to a multi Cloud deployment. As well as describe some migration options.

jacques_chester · on June 10, 2017

I think there will eventually be a lot of headslapping blogposts from people who migrate to GCP. It really is that much better and still costs less.

Where I work we are saving ... uh ... a lot of money by migrating our workloads over. And they're faster. It's awesome.

blazespin · on June 10, 2017

Google Cloud is great and cheap. But AWS is waaay more ahead in terms of completeness. I've looked into this. The breadth, depth, and polish of AWS is really impressive. Their API Gateway -> lambda container -> DataStore is amazing. I really wish I could use GCP. :(

Hopefully GCP will start pushing AWS to lower their prices. They charge so much.

_0nac · on June 10, 2017

Curious: How does that differ from GCP's Cloud Endpoints/Apigee Edge -> Cloud Functions -> Cloud Datastore?

Disclaimer: I work for GCP.

tyingq · on June 10, 2017

Some of it is just documentation and marketing. If you go to the AWS api gateway main page, it talks about serverless/lambda integration right there. They also have specific pages that show how to use them together. Including using them as the endpoint, or using them at a higher level to influence routing to a different endpoint, or modify requests, etc.

Contrast with Google, you start with the confusion that there are two products (cloud endpoints, apigee), and neither really talks about integration with Cloud functions.

squeaky-clean · on June 10, 2017

In my case the first obstacle (and I haven't looked into it too much) is we have a lot of Lambda functions written in Python and Scala.

tyingq · on June 10, 2017

The API Gateway is somewhat limited in that the endpoints are tied to CloudFront. So, if you want to leverage it internally, versus connections from a user's web browser, the request leaves your VPC (and potentially goes quite a ways out) for no obvious reason. It would be nice if they decoupled API gateway from CloudFront.

lostcolony · on June 10, 2017

If it's internal, you probably shouldn't be using the API Gateway. You can just call the underlying lambda directly through the SDK with invoke and save yourself the cost of the API Gateway, as well as be able to invoke it asynchronously should that be a better fit.

That said, it depends on your design/architecture as to whether that is appropriate or not. Sometimes you want that distinct separation. But if that's the case, you should also probably not even be thinking about that low latency case, because you should treat it as a service outside of your control, that may be (and probably should be) cross regionally available (so you might even be crossing the country, for all you know. This presumes the inclusion of latency based routing with a custom DNS, of course, but for a standalone, client facing service, that's what you should be doing anyway).

Obviously there may be a middle ground you want, but I'd say you probably either want to treat it as "your" API, and thus, you can be as pragmatic as you want when accessing it, or you want to treat it as a reusable, completely separate service, that you are a client of, in which case things like location can not be guaranteed, and thus latency is a reality.

tyingq · on June 10, 2017

>If it's internal, you probably shouldn't be using the API Gateway

It's common enough that when I asked Amazon about it, they already had enough requests that it's on the roadmap.

People use api gateways for all sorts of things like orchestration, tokenizing sensitive data, caching requests, etc, that should have the same interface whether used internally or externally. Sure, I could recreate all that functionality for internal use, but that's one of the supposed benefits of an API gateway...turn what was code into configuration. See products like Apigee, Kong, Tyk, etc.

lostcolony · on June 10, 2017

Per your OP, how do you plan on caching requests without it hitting Cloudfront in some way? :P

But, I -did- mention there were caveats. Your correct in tha I should probably not have led with "You probably shouldn't", and -then- gotten more nuanced, though. While I agree that optimizing it to not leave the AWS network would be of benefit, if you reach the point you're calling it via the API Gateway internally, it's a separate service, and if it has -any- high availability requirements, or ever will, you should not be reliant on it not leaving the network (since even without going multi-region with distributed DNS, it may be handled by another AZ, introducing latency).

tyingq · on June 10, 2017

>Per your OP, how do you plan on caching requests without it hitting Cloudfront in some way?

Products that compete with Amazon's API gateway can do it, though you're right that their product cannot.

Edit: There are also AWS customers that use a different CDN...Akamai, Cloudflare, etc. Decoupling API gateway and Cloudfront would have benefit for them.

josiahpeters · on June 10, 2017

We recently used API gateway to publish our first "public" facing API for some of our partners to integrate with. We we're using swagger to describe our internal APIs. A subset of those APIs needed to be exposed to partners. We also wanted to enforce a different security model and leverage rate limiting for these requests to protect us from unexpected growth.

Integrating API gateway into our deployment took a little bit of work, but we're happy to have it part of our automated deployment process. We push updates of our swagger spec with AWS CLI to a separate API per environment (dev, staging, and prod-west, prod-east). Each of environment has one API gateways with several stages that we use for green/blue deployments. The stages are green, blue, and public. We deploy to the inactive color (green or blue) smoke test and promote that stage to public. This simulates the same process we do in production and the swap to public is akin to our DNS swap from one side to the other. We were very happy that it wasn't a huge time investment to fit this workflow into our deployment process.

Somehow we missed the fact that because API gateway was tied to CloudFront that it did not support multi regional failover. If we have a service outage in our primary region we swap all of our traffic with DNS to our secondary region (from us-west-2 to us-east-1). In order for us to do a regional failover with API gateway we would need to deprovision the endpoint api.domain.com in the failing region and provision api.domain.com in our failover region. I'm expecting if we have to do a regional failover there is a very good chance that the request to deprovision the endpoint will likely fail. See my AWS forum reply and other's asking for the same feature: https://forums.aws.amazon.com/message.jspa?messageID=761925#...

    My team is very much looking forward to this functionality. We have duplicated our infrastructure across multiple regions to mitigate a region outage. We were planning on relying on a dns change to route traffic to a new region.
    
    With the limitations that you have described, our failover process is much more complicated and uncertain. 
    
    Given the two regions primary (us-west-2) and secondary (us-east-1). We will need to attempt to disassociate our api.domain.com from primary API Gateway and wait for that operation to complete. After that we must associate our secondary API Gateway with api.domain.com. Its very possible and likely issues on primary that caused us to fail over to secondary may prevent the disassociating from the primary. This is an extremely undesirable side effect as it could leave us with unhandled requests or error responses to api.domain.com.
    
    I wanted to post this here for others suggest alternative workarounds or ideas. We will also follow up with a support ticket.

They are aware of the issue and it seems like something on their roadmap, but that could mean 1-2 years out. Thankfully this partner API doesn't have near the same SLA and HA requirements we have for other features of our SaaS product. If our SLA requirements were higher we would need to consider a different solution. We have handwaved some ideas that we could use if there was an extensive outage which likely equates to standing up a reverse proxy that listens on api.domain.com and proxies traffic to the API gateway url for our secondary region. That certainly isn't ideal and doubles the latency that we have with API Gateway already proxying requests to our internal API.

cyberferret · on June 9, 2017

Nice post, and interesting to see your learning pretty much matches my own when building my SaaS business.

Curious as to why you didn't opt for RDS for the database side of things? I actually have about 5 side projects as well as my main SaaS, and they all share the same RDS instance for the database which saves me cost while keeping performance high.

Also, do you use Lambda much at all for periodic scripts that need to run? I've more recently been building more and more mini applets on Lambda for things like health checks or replicating data across Amazon services etc. with good results.

Also, and this might be a good complimentary service to your monitoring service, but I've been using CloudWatch's timed tasks a lot recently to trigger some of those Lambda instances - very quick and easy. I've been using another service to monitor and report on missed triggers, but will look into Cronitor a bit more as a viable option.

lostcolony · on June 10, 2017

If your use case fits S3 (eventually consistent updates are okay, no complex query model), you should almost always prefer it to RDS. S3 is usually cheaper than RDS (lots of updates may negate this, but storage costs and then usage based billing = most cases are cheaper. If your DB load is so low that you can reuse it between projects, and your query model is sufficiently simple that S3 could serve it up, you might try running some numbers). It also scales transparently, and infinitely. RDS requires downtime to scale horizontally, and goes only so far.

The general suggestion, from AWS themselves, is to try and fit things into S3 first, and only go for RDS (or DynamoDB) if it makes sense to. Barring a clear indicator, S3 will almost always be cheaper, and has no scaling concerns. That's also been our experience; defaulting to S3 has been the cheapest, and required the least maintenance from us, and only when we come up with a good reason (need for complex or high performance queries, generally) did it ever make sense for us to reach for RDS or DynamoDB.

hamandcheese · on June 10, 2017

Is this satire? There is so much utility in having a (relational) database that comparing it to S3, on the pretense cost savings, is ludicrous.

lostcolony · on June 10, 2017

-If you benefit from it-. Utility only matters if you're using it; otherwise, why pay for it?

meritt · on June 10, 2017

They are running MySQL on an m4.xlarge server. He's asking why they aren't using RDS instead.

What does S3 have to do with his question?

lostcolony · on June 10, 2017

I also saw they were using S3; I thought that was what the OP question was in relation to.

paulddraper · on June 10, 2017

Failover time for RDS is high, like a minute or two.

That's probably fine for lots of stuff, but not everything.

mrep · on June 10, 2017

How do you get higher availability than that?

paulddraper · on June 10, 2017

Use Virtual IP rather than DNS for failover.

Minutes is a really long time to go down.

mrep · on June 11, 2017

Have you tested that?

If you actually have an in-house solution to providing better availability than amazon's RDS options, then I recommend you market your engineering solution to the major cloud providers as they will pay you very well to help them implement your solution to that problem.

paulddraper · on June 11, 2017

> Have you tested that?

Only all the time.

> I recommend you market your engineering solution

I saw exactly that company at the last re:Invent. Can't remember what the name was...

> to the major cloud providers as they will pay you very well to help them implement your solution to that problem.

My guess is that most reasonably sized organization manager their own databases for uptime-critical systems (case in point). It lets them have more control over backups, when to upgrade the database, how to handle failover, etc.

wahnfrieden · on June 10, 2017

Take a look at Aurora - around ten seconds.

paulddraper · on June 10, 2017

If you use MariaDB drivers.

mwcampbell · on June 10, 2017

Can you link to more info on this? Do you mean using libmariadb, or some other driver (like a JDBC one)? What's the advantage over MySQL's libmysqlclient?

paulddraper · on June 10, 2017

IIRC, MariaDB has a table with the IP address of the master, and the MariaDB drivers (not sure which ones exactly) can check that table.

That's the only way you can get a 10s failover.

mwcampbell · on June 10, 2017

Why is that the only way to get a 10s failover? The TTL on the DNS records for Amazon RDS, at least for Aurora, is just 5s.

paulddraper · on June 10, 2017

I think because route53 is slow to update? I don't know.

I do know that AWS itself documents 60-120s for RDS. http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concep...

encoderer · on June 9, 2017

A few months ago my co-founder August did a great Indie Hackers interview https://www.indiehackers.com/businesses/cronitor and we heard a lot of positive feedback on sharing revenue and growth numbers. On that theme, we thought we'd write a blog post to give a little more color to our largest expense, AWS.

aeorgnoieang · on June 9, 2017

The post was great.

The only thing I didn't like was "spend" in the title – "costs" or "expenses" would be less pretentious, to my ears and in my mind. This is a really trivial quibble! [And apparently this is a really old word in English anyways][1]. Now I feel like a grumpy old person.

[1]: https://english.stackexchange.com/a/79055/42347

ci5er · on June 9, 2017

Not to quibble with your quibble (but here I go anyway)... The term "Marketing Spend of X" or "Operating Spend of Y" is fairly common business-analysis-speak. At least in certain regions of the United States. IIRC, I also heard it used by Telstra and ANZ execs during some projects I had to take care of in Australia.

ljw1001 · on June 10, 2017

I think that's a reasonable "ask".

douglasfshearer · on June 9, 2017

Great post. It's good to see an enterprise smaller than Segment [0] talk about how complex AWS pricing is.

You mention you use SQS to queue incoming metrics. What server/framework/language are you using to do this? IME using SQS requires tools with good parallelism to deal with it's high-latency/high-throughput performance characteristics.

[0] https://segment.com/blog/spotting-a-million-dollars-in-your-...

paultopia · on June 10, 2017

How much of devops in AWS-land is really forensic accountancy?

Moneymaking idea for someone: a devops person, a lawyer, and an accountant get together and charge companies to predict their cloud bill.

samstave · on June 10, 2017

Use cloudability to monitor cloud spend and prediction is pretty easy..

paultopia · on June 10, 2017

Naturally someone has already started that business :-)

gingerlime · on June 10, 2017

I'm a bit curious about the cost/revenue ratio. Is 12.5% considered a reasonable ratio? I just checked our spend, which includes not only hosting but other IT related spends (helpdesk, email, slack, email marketing service, etc). I think our spend is around the 3-4% mark compared to revenue.

Now I'm not trying to show off or say we're doing things better. If anything, maybe this post can help me convince my cofounders that we should spend more on infrastructure hosting (there's lots of things we could improve). I'm mostly curious what's a typical ratio in other smallish startups.

augustflanagan · on June 10, 2017

Cronitor co-founder here. This is a great question! I would love to see others post some numbers here.

As Shane mentioned in the post 12.5% has been a consistent number for us as we have scaled up over the past couple years. That said, at this point I suspect we will see this percentage decrease a bit going forward. I'm basing this on intuiting that we're a bit over provisioned at the moment and won't have to scale our infrastructure linearly with user growth over the next year or two. Of course, that remains to be seen...

To add some more numbers to the conversation my full time job is CTO at a consumer facing tech company (Babylist.com) I just looked at our spending on IT/infrastructure for the last couple of months and it's around 1.5-2% of revenue.

gingerlime · on June 11, 2017

Thanks. I guess there are economies of scale / barrier of entry costs involved that should reduce this ratio over time.

On a side note. When people talk about x-figures revenue, are they typically talking about x-figures monthly or yearly?

meritt · on June 10, 2017

3 person company with revenue in the low seven-figures. Our AWS + IT spend is about 3.5% of revenue

ryanmaynard · on June 11, 2017

I have a side project that seems stuck in the mid 6-figures. My 6 month avg spend for AWS is 4.62%. However, that percentage is not linear. It would jump a tier around .85mm

mwcampbell · on June 10, 2017

Just curious, what kind of hosting are you using? Are you also on AWS, or something cheaper like DigitalOcean? Or bare metal?

gingerlime · on June 11, 2017

We're using mostly DO, but to be honest, I'm not sure moving to AWS or Google Cloud would shift the ratio considerably. It might double our hosting costs, but hosting probably isn't our biggest spend when looking at overall IT.

To give an example, I think we spend a similar amount or even more on customer.io (email marketing automation) than we do on our entire hosting. You could argue that this cost is a marketing cost rather than technology though. But currently we still attribute it to IT.

dasil003 · on June 9, 2017

I love AWS and for any startup that I expected to need to scale quickly I would choose it in a heartbeat.

However when it comes to low-traffic side projects and experiments where costs of a few hundred dollars matter, I prefer Linode or DigitalOcean with ansible provisioning and B2 for block storage. It will cost you more time for sure, but it will give you better performance at the low-end which means you can break even sooner. If it takes off you can always migrate to AWS, GCP or Azure later.

uji · on June 9, 2017

GCloud recently updated their free-tier (https://cloud.google.com/free/). They provide $300 credit which can be used for any product in GCP, and some services when used within quota are always free. So you might want to give GCP a try for your low-traffic side projects.

shusson · on June 10, 2017

Just to clarify, the $300 is linked to your personal account and once it expires it's gone forever (at least for now).

nunez · on June 10, 2017

can't you just create new accounts ad infintum?

icebraining · on June 10, 2017

You need to add a CC or bank account.

desdiv · on June 9, 2017

Dumb question: is there a way to show the total AWS costs you're incurring in the past second/minute/hour?

The finest granularity I can find in the AWS web console is per day; many times I've butterfingered an input and only caught it a few days after the fact due to the unexpected bill increase.

cldellow · on June 9, 2017

AWS supports publishing detailed billing information to an S3 bucket on an hourly frequency with excellent granularity as to which resource and operation is driving spend.

Unfortunately, to make meaningful use of this you pretty much have to roll your own infra to download it, load it and analyze it. :(

samstave · on June 10, 2017

Once again... contact cloudability.

thenaturalist · on June 10, 2017

cloudability starts at $499 minimum per month. Not sure how this seems to be a feasible option for side projects.

shusson · on June 10, 2017

yeah I was surprised by this as well when I first started using AWS and GCP

thenaturalist · on June 10, 2017

Hey, would love to understand this a bit more and see what a solution that is valuable to you looks like. I wasn't able to find contact details on your profile, so if you'd be up for discussing this a bit further, you can find my email in my profile.

cyberferret · on June 9, 2017

You may be able to set up a CloudWatch alarm and monitor the dashboard there? Haven't played with it for billing, but have done so for Lambda executions. I think their granularity can go down to around 5 minutes IIRC.

desdiv · on June 10, 2017

Thanks for the tip. I enabled CloudWatch for billing, and I'm getting something, but it's not pretty: http://i.imgur.com/SO2dJzd.png

No matter what graph format I choose (between line and stacked area), all I'm getting is a single dot at 00:00 AM today for $0.35, which is the correct amount for cumulative spending this month. But it doesn't seem to be updating or graphing it correctly.

cyberferret · on June 10, 2017

Yeah, I think even from my experience with their other services, that graph can be a bit misleading - even though it shows the timeline in 5 min increments, I think the actual refresh in the back end is a lot longer than that. No idea with their Billing system, but I think even Lambda doesn't get updated more than once every 15 minutes or so.

putnam · on June 10, 2017

Thanks for asking. Would like this too.

zitterbewegung · on June 10, 2017

Reserved instances are never really good starting out. I think it's better used when you understand your setup and your setup has been working for years. Or if you are migrating a service that you understand the load that was on premesis.

magacloud · on June 12, 2017

Nice post indeed. Few cents from our experience at ActOnCloud, Often Cloud-first and Cloud-only companies would face similar issues in future. Cloud is a one way trip. Unless one takes care of vendor lock-in and prepare the team for better governance and financial control, its going to be nightmare.

With cloud providers pushing for Serverless, it will be even more darker, there is not a way to get control, Just hope that everything will be greener.

ecesena · on June 9, 2017

Surprised to not see RDS and ElastiCache. What do you use as "db"?

encoderer · on June 9, 2017

We run our own MySQL and Redis on an Ec2 instance. Originally it was due to cost -- you essentially are charged by instance when using these services and we could run them both from a single instance. Today, it's really due to not wanting or needing to do the migration.

I will add, we've paid when we've had to do several manual db upgrades in that time:

- Original m3.medium on non-provisioned iops

- Upgrade m4.large

- Upgrade to an EBS with provisioned iops

- Upgrade to m4.xlarge

These are easy enough, but not nearly as easy as an RDS upgrade is.

bluedino · on June 10, 2017

What are the reasons for three months being 3X the rest?

encoderer · on June 10, 2017

We purchased reserved instances. When I quote the 12.5% number it's amortizing the ri cost.

Reserved instances have saved us a ton of money, even with my screw up last October.

alberts00 · on June 10, 2017

  You should only use T2 instances within an auto-scaling group.

Could someone please elaborate? I always thought this is a horrible idea because upon CPU credit exhaustion the AWS metrics show CPU utilisation including the burst. For example, if you run out of CPU credits for t2.micro it will show around 15% in AWS whilst on the instance itself you will see 100%.

philliphaydon · on June 10, 2017

Using t2 rds in production for 1 year for some stuff at work. We used cloudwatch to monitor the cpu credit balance. Moved from t2 small to t2 medium. Fixed some queries that used too much cpu. Still running strong. Never had down time.

nunez · on June 10, 2017

i think you can alarm on remaining CPU credits: https://serverfault.com/a/828193

mi100hael · on June 9, 2017

> You pay by number of requests, not number of messages, so it reduces costs and makes it significantly faster to gulp down messages.

Are there any other technical changes you've made to your application code specifically in response to AWS costs? Any other recommendations when designing a new application?

FigmentEngine · on June 10, 2017

There is advice like this in the "Cost Optimization" whitepaper of Well-Architected

aws.amazon.com/well-architected

disclaimer: work for AWS and author

FigmentEngine · on June 10, 2017

https://aws.amazon.com/well-architected

encoderer · on June 9, 2017

I'll think about this and reply this weekend. Check back.

danerov · on June 10, 2017

Really interesting to see the cost of other small companies.