> Cloud solutions like elastic load balancing are generally engineered for the average use case. Think about the ways you are not like the average.
This might be the case for some things, but many of Google Cloud's "elastic/serverless" products (Load Balancers, App Engine, Pub/Sub, BigQuery, Datastore, etc) can truly scale up from nothing to huge traffic spike without the need for pre-warming and things like that.
Came here just to reiterate this. Google Cloud Load balancers don't need warming, manual creation of support tickets, etc.
Also, networking on GCP is vastly superior to AWS. The VPC is setup with sane defaults and VM instances can communicate privately to other VM instances in any zone or region without needing complicated constructs such as VPN's or NAT instances.
Lastly, Google Cloud is generally cheaper (sustained use-discounts), higher performance, and takes all the lessons learned from AWS and applies them.
The one knock I have on GCP is billing. It is incredibly obtuse. Even the monthly billing invoices I still have a hard time digesting (especially if you have multiple projects under a single billing account).
> The one knock I have on GCP is billing. It is incredibly obtuse.
Really? We've found it to be much easier. The billing section has up-to-date current breakouts and a simple list of transactions for each billing account. There are budget alerts and the best part is the easy export into bigquery which then gives you SQL access for all of it. Add in datastudio and you have BI style analysis and visualizations too. Better than anything else the other clouds have.
Any AWS users have a better experience with their newer "ALB" load balancer? It doesn't specifically call out any performance benefits over the legacy "ELB", but since it's much newer, I'm curious if it happens to do better.
Yes. Although Administration / UI is slightly more confusing. We had unique performance issues with the ELB offering that couldn't be fixed because it was the way it was designed. With ALB, that performance issue no longer existed.
When I read that your load balancer is not individually provisioned the way Amazon's is I was envious. I couldn't imagine migrating to a new cloud provider now but I'm really impressed with what Google has put together!
Google Cloud is great and cheap. But AWS is waaay more ahead in terms of completeness. I've looked into this. The breadth, depth, and polish of AWS is really impressive. Their API Gateway -> lambda container -> DataStore is amazing. I really wish I could use GCP. :(
Hopefully GCP will start pushing AWS to lower their prices. They charge so much.
Some of it is just documentation and marketing. If you go to the AWS api gateway main page, it talks about serverless/lambda integration right there. They also have specific pages that show how to use them together. Including using them as the endpoint, or using them at a higher level to influence routing to a different endpoint, or modify requests, etc.
Contrast with Google, you start with the confusion that there are two products (cloud endpoints, apigee), and neither really talks about integration with Cloud functions.
The API Gateway is somewhat limited in that the endpoints are tied to CloudFront. So, if you want to leverage it internally, versus connections from a user's web browser, the request leaves your VPC (and potentially goes quite a ways out) for no obvious reason. It would be nice if they decoupled API gateway from CloudFront.
If it's internal, you probably shouldn't be using the API Gateway. You can just call the underlying lambda directly through the SDK with invoke and save yourself the cost of the API Gateway, as well as be able to invoke it asynchronously should that be a better fit.
That said, it depends on your design/architecture as to whether that is appropriate or not. Sometimes you want that distinct separation. But if that's the case, you should also probably not even be thinking about that low latency case, because you should treat it as a service outside of your control, that may be (and probably should be) cross regionally available (so you might even be crossing the country, for all you know. This presumes the inclusion of latency based routing with a custom DNS, of course, but for a standalone, client facing service, that's what you should be doing anyway).
Obviously there may be a middle ground you want, but I'd say you probably either want to treat it as "your" API, and thus, you can be as pragmatic as you want when accessing it, or you want to treat it as a reusable, completely separate service, that you are a client of, in which case things like location can not be guaranteed, and thus latency is a reality.
>If it's internal, you probably shouldn't be using the API Gateway
It's common enough that when I asked Amazon about it, they already had enough requests that it's on the roadmap.
People use api gateways for all sorts of things like orchestration, tokenizing sensitive data, caching requests, etc, that should have the same interface whether used internally or externally. Sure, I could recreate all that functionality for internal use, but that's one of the supposed benefits of an API gateway...turn what was code into configuration. See products like Apigee, Kong, Tyk, etc.
Per your OP, how do you plan on caching requests without it hitting Cloudfront in some way? :P
But, I -did- mention there were caveats. Your correct in tha I should probably not have led with "You probably shouldn't", and -then- gotten more nuanced, though. While I agree that optimizing it to not leave the AWS network would be of benefit, if you reach the point you're calling it via the API Gateway internally, it's a separate service, and if it has -any- high availability requirements, or ever will, you should not be reliant on it not leaving the network (since even without going multi-region with distributed DNS, it may be handled by another AZ, introducing latency).
>Per your OP, how do you plan on caching requests without it hitting Cloudfront in some way?
Products that compete with Amazon's API gateway can do it, though you're right that their product cannot.
Edit: There are also AWS customers that use a different CDN...Akamai, Cloudflare, etc. Decoupling API gateway and Cloudfront would have benefit for them.
We recently used API gateway to publish our first "public" facing API for some of our partners to integrate with. We we're using swagger to describe our internal APIs. A subset of those APIs needed to be exposed to partners. We also wanted to enforce a different security model and leverage rate limiting for these requests to protect us from unexpected growth.
Integrating API gateway into our deployment took a little bit of work, but we're happy to have it part of our automated deployment process. We push updates of our swagger spec with AWS CLI to a separate API per environment (dev, staging, and prod-west, prod-east). Each of environment has one API gateways with several stages that we use for green/blue deployments. The stages are green, blue, and public. We deploy to the inactive color (green or blue) smoke test and promote that stage to public. This simulates the same process we do in production and the swap to public is akin to our DNS swap from one side to the other. We were very happy that it wasn't a huge time investment to fit this workflow into our deployment process.
Somehow we missed the fact that because API gateway was tied to CloudFront that it did not support multi regional failover. If we have a service outage in our primary region we swap all of our traffic with DNS to our secondary region (from us-west-2 to us-east-1). In order for us to do a regional failover with API gateway we would need to deprovision the endpoint api.domain.com in the failing region and provision api.domain.com in our failover region. I'm expecting if we have to do a regional failover there is a very good chance that the request to deprovision the endpoint will likely fail. See my AWS forum reply and other's asking for the same feature: https://forums.aws.amazon.com/message.jspa?messageID=761925#...
My team is very much looking forward to this functionality. We have duplicated our infrastructure across multiple regions to mitigate a region outage. We were planning on relying on a dns change to route traffic to a new region.
With the limitations that you have described, our failover process is much more complicated and uncertain.
Given the two regions primary (us-west-2) and secondary (us-east-1). We will need to attempt to disassociate our api.domain.com from primary API Gateway and wait for that operation to complete. After that we must associate our secondary API Gateway with api.domain.com. Its very possible and likely issues on primary that caused us to fail over to secondary may prevent the disassociating from the primary. This is an extremely undesirable side effect as it could leave us with unhandled requests or error responses to api.domain.com.
I wanted to post this here for others suggest alternative workarounds or ideas. We will also follow up with a support ticket.
They are aware of the issue and it seems like something on their roadmap, but that could mean 1-2 years out. Thankfully this partner API doesn't have near the same SLA and HA requirements we have for other features of our SaaS product. If our SLA requirements were higher we would need to consider a different solution. We have handwaved some ideas that we could use if there was an extensive outage which likely equates to standing up a reverse proxy that listens on api.domain.com and proxies traffic to the API gateway url for our secondary region. That certainly isn't ideal and doubles the latency that we have with API Gateway already proxying requests to our internal API.
Nice post, and interesting to see your learning pretty much matches my own when building my SaaS business.
Curious as to why you didn't opt for RDS for the database side of things? I actually have about 5 side projects as well as my main SaaS, and they all share the same RDS instance for the database which saves me cost while keeping performance high.
Also, do you use Lambda much at all for periodic scripts that need to run? I've more recently been building more and more mini applets on Lambda for things like health checks or replicating data across Amazon services etc. with good results.
Also, and this might be a good complimentary service to your monitoring service, but I've been using CloudWatch's timed tasks a lot recently to trigger some of those Lambda instances - very quick and easy. I've been using another service to monitor and report on missed triggers, but will look into Cronitor a bit more as a viable option.
If your use case fits S3 (eventually consistent updates are okay, no complex query model), you should almost always prefer it to RDS. S3 is usually cheaper than RDS (lots of updates may negate this, but storage costs and then usage based billing = most cases are cheaper. If your DB load is so low that you can reuse it between projects, and your query model is sufficiently simple that S3 could serve it up, you might try running some numbers). It also scales transparently, and infinitely. RDS requires downtime to scale horizontally, and goes only so far.
The general suggestion, from AWS themselves, is to try and fit things into S3 first, and only go for RDS (or DynamoDB) if it makes sense to. Barring a clear indicator, S3 will almost always be cheaper, and has no scaling concerns. That's also been our experience; defaulting to S3 has been the cheapest, and required the least maintenance from us, and only when we come up with a good reason (need for complex or high performance queries, generally) did it ever make sense for us to reach for RDS or DynamoDB.
If you actually have an in-house solution to providing better availability than amazon's RDS options, then I recommend you market your engineering solution to the major cloud providers as they will pay you very well to help them implement your solution to that problem.
> I recommend you market your engineering solution
I saw exactly that company at the last re:Invent. Can't remember what the name was...
> to the major cloud providers as they will pay you very well to help them implement your solution to that problem.
My guess is that most reasonably sized organization manager their own databases for uptime-critical systems (case in point). It lets them have more control over backups, when to upgrade the database, how to handle failover, etc.
Can you link to more info on this? Do you mean using libmariadb, or some other driver (like a JDBC one)? What's the advantage over MySQL's libmysqlclient?
A few months ago my co-founder August did a great Indie Hackers interview https://www.indiehackers.com/businesses/cronitor and we heard a lot of positive feedback on sharing revenue and growth numbers. On that theme, we thought we'd write a blog post to give a little more color to our largest expense, AWS.
The only thing I didn't like was "spend" in the title – "costs" or "expenses" would be less pretentious, to my ears and in my mind. This is a really trivial quibble! [And apparently this is a really old word in English anyways][1]. Now I feel like a grumpy old person.
Not to quibble with your quibble (but here I go anyway)... The term "Marketing Spend of X" or "Operating Spend of Y" is fairly common business-analysis-speak. At least in certain regions of the United States. IIRC, I also heard it used by Telstra and ANZ execs during some projects I had to take care of in Australia.
Great post. It's good to see an enterprise smaller than Segment [0] talk about how complex AWS pricing is.
You mention you use SQS to queue incoming metrics. What server/framework/language are you using to do this? IME using SQS requires tools with good parallelism to deal with it's high-latency/high-throughput performance characteristics.
I'm a bit curious about the cost/revenue ratio. Is 12.5% considered a reasonable ratio? I just checked our spend, which includes not only hosting but other IT related spends (helpdesk, email, slack, email marketing service, etc). I think our spend is around the 3-4% mark compared to revenue.
Now I'm not trying to show off or say we're doing things better. If anything, maybe this post can help me convince my cofounders that we should spend more on infrastructure hosting (there's lots of things we could improve). I'm mostly curious what's a typical ratio in other smallish startups.
Cronitor co-founder here. This is a great question! I would love to see others post some numbers here.
As Shane mentioned in the post 12.5% has been a consistent number for us as we have scaled up over the past couple years. That said, at this point I suspect we will see this percentage decrease a bit going forward. I'm basing this on intuiting that we're a bit over provisioned at the moment and won't have to scale our infrastructure linearly with user growth over the next year or two. Of course, that remains to be seen...
To add some more numbers to the conversation my full time job is CTO at a consumer facing tech company (Babylist.com) I just looked at our spending on IT/infrastructure for the last couple of months and it's around 1.5-2% of revenue.
I have a side project that seems stuck in the mid 6-figures. My 6 month avg spend for AWS is 4.62%. However, that percentage is not linear. It would jump a tier around .85mm
We're using mostly DO, but to be honest, I'm not sure moving to AWS or Google Cloud would shift the ratio considerably. It might double our hosting costs, but hosting probably isn't our biggest spend when looking at overall IT.
To give an example, I think we spend a similar amount or even more on customer.io (email marketing automation) than we do on our entire hosting. You could argue that this cost is a marketing cost rather than technology though. But currently we still attribute it to IT.
I love AWS and for any startup that I expected to need to scale quickly I would choose it in a heartbeat.
However when it comes to low-traffic side projects and experiments where costs of a few hundred dollars matter, I prefer Linode or DigitalOcean with ansible provisioning and B2 for block storage. It will cost you more time for sure, but it will give you better performance at the low-end which means you can break even sooner. If it takes off you can always migrate to AWS, GCP or Azure later.
GCloud recently updated their free-tier (https://cloud.google.com/free/). They provide $300 credit which can be used for any product in GCP, and some services when used within quota are always free. So you might want to give GCP a try for your low-traffic side projects.
Dumb question: is there a way to show the total AWS costs you're incurring in the past second/minute/hour?
The finest granularity I can find in the AWS web console is per day; many times I've butterfingered an input and only caught it a few days after the fact due to the unexpected bill increase.
AWS supports publishing detailed billing information to an S3 bucket on an hourly frequency with excellent granularity as to which resource and operation is driving spend.
Unfortunately, to make meaningful use of this you pretty much have to roll your own infra to download it, load it and analyze it. :(
Hey, would love to understand this a bit more and see what a solution that is valuable to you looks like. I wasn't able to find contact details on your profile, so if you'd be up for discussing this a bit further, you can find my email in my profile.
You may be able to set up a CloudWatch alarm and monitor the dashboard there? Haven't played with it for billing, but have done so for Lambda executions. I think their granularity can go down to around 5 minutes IIRC.
Thanks for the tip. I enabled CloudWatch for billing, and I'm getting something, but it's not pretty: http://i.imgur.com/SO2dJzd.png
No matter what graph format I choose (between line and stacked area), all I'm getting is a single dot at 00:00 AM today for $0.35, which is the correct amount for cumulative spending this month. But it doesn't seem to be updating or graphing it correctly.
Yeah, I think even from my experience with their other services, that graph can be a bit misleading - even though it shows the timeline in 5 min increments, I think the actual refresh in the back end is a lot longer than that. No idea with their Billing system, but I think even Lambda doesn't get updated more than once every 15 minutes or so.
Reserved instances are never really good starting out. I think it's better used when you understand your setup and your setup has been working for years. Or if you are migrating a service that you understand the load that was on premesis.
Nice post indeed. Few cents from our experience at ActOnCloud, Often Cloud-first and Cloud-only companies would face similar issues in future. Cloud is a one way trip. Unless one takes care of vendor lock-in and prepare the team for better governance and financial control, its going to be nightmare.
With cloud providers pushing for Serverless, it will be even more darker, there is not a way to get control, Just hope that everything will be greener.
We run our own MySQL and Redis on an Ec2 instance. Originally it was due to cost -- you essentially are charged by instance when using these services and we could run them both from a single instance. Today, it's really due to not wanting or needing to do the migration.
I will add, we've paid when we've had to do several manual db upgrades in that time:
- Original m3.medium on non-provisioned iops
- Upgrade m4.large
- Upgrade to an EBS with provisioned iops
- Upgrade to m4.xlarge
These are easy enough, but not nearly as easy as an RDS upgrade is.
You should only use T2 instances within an auto-scaling group.
Could someone please elaborate?
I always thought this is a horrible idea because upon CPU credit exhaustion the AWS metrics show CPU utilisation including the burst. For example, if you run out of CPU credits for t2.micro it will show around 15% in AWS whilst on the instance itself you will see 100%.
Using t2 rds in production for 1 year for some stuff at work. We used cloudwatch to monitor the cpu credit balance. Moved from t2 small to t2 medium. Fixed some queries that used too much cpu. Still running strong. Never had down time.
> You pay by number of requests, not number of messages, so it reduces costs and makes it significantly faster to gulp down messages.
Are there any other technical changes you've made to your application code specifically in response to AWS costs? Any other recommendations when designing a new application?
> Cloud solutions like elastic load balancing are generally engineered for the average use case. Think about the ways you are not like the average.
This might be the case for some things, but many of Google Cloud's "elastic/serverless" products (Load Balancers, App Engine, Pub/Sub, BigQuery, Datastore, etc) can truly scale up from nothing to huge traffic spike without the need for pre-warming and things like that.