Hmm I would like to hear more about these problems that prevent people from spin...

krallin · on June 4, 2013

In the past, when an AZ goes down, everyone tries to transfer to another AZ and makes lots of API calls.

Consequently AWS has to rate-limit API requests to avoid taking another AZ down.

Maybe this has changed though, but we'll probably have to wait for another disaster to know!

---

But of course, an AZ shouldn't be your infrastructure's single point of failure!

jd007 · on June 4, 2013

I'm aware of the API rate limit when an AZ goes down, but the original comment was about problems that people run into for highly-variable traffic sites that scale up and down on a regular basis (e.g. spin up 20 instances during the day with high traffic, at night shutdown 15 of them to save cost, on a daily basis). This of course is not related to any AZ/API downtime, and I am not aware of any problems that could interfere with the normal usage of APIs during normal service operations, which is why I wanted to hear more about the details of those problems.

rscale · on June 4, 2013

It's due to rarely occurring problems such as APIs down/unresponsive, insufficient/incorrect instances available, and a third major class of problem that's frustrating my memory at the moment. In addition to not scaling up and down (automatically or manually), many of these apps are architected to require as little from AWS as is possible to reduce their exposure. For example, they'll refrain from using ELB because ELB depends on EBS.

These decisions were made by companies that started off believing fully in the promise of elasticity, and gradually shifted to less elastic architectures as they experienced issues. That said, it's worth noting that these are firms with very high costs of downtime, so the magnitude of failures was very high.