However obviously it's good PR, and we all appreciate the Mea Cupla from Heroku, the fact is, they are proposing to migrate to a situation where they are still completely reliant on AWS for their hosting.
I'm just not sure you can really say "We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to.", when at the end of the day, you are again relying on a company that has failed you in the past.
Not trying to attack Amazon or Heroku, I'm honestly intrigued by this issue; not to mention the fact that we are facing the exact same decision at work.
Regarding Heroku's plan to continue relying on a company, Amazon, that failed them before:
If Heroku evolves to an architecture in which they utilize multiple AWS regions (as they mention in lesson #1 of their post-mortem) and if each region has a distinctly partitioned API "control plane," this should result in a materially improved availability situation for Heroku. EC2 Availability Zones guard against machine, power, and building failures. EC2 Regions should theoretically guard against API infrastructure and AWS software code failures.
Heroku need not necessarily ditch their current single-IaaS-provider architecture in order to achieve significantly better control over their service's uptime.
On the other hand, when downtime does occur, the ability for Heroku to prioritize their incident response manpower to first handle paying customers has its limits based on their downstream dependencies. If all the broken bits are within Amazon's black box, Heroku doesn't have much control over prioritization (Amazon fixes your stuff whenever it gets around to fixing your stuff). If Heroku operated over multiple cloud providers, even with the added complexity of such an approach, at least Heroku would have control over choosing which of their most important customers to migrate first to a working cloud, away from a broken and black box cloud.
In the end, I certainly don't see these considerations as simple. It's easy to cry when things go wrong, but I think the level of scalability and availability that has been achieved up to the present is quite noteworthy.
Interesting that this is essentially all stemming from yet again a communication failure from AWS. Once they have a post-mortem and can explain the multi-AZ issue, we may have a better idea of whether multi-region spread is sufficient redundancy. Or they could completely fail to communicate enough information, and adequately wary customers will be left with no choice but to assume that regions are not sufficiently independent.
However obviously it's good PR, and we all appreciate the Mea Cupla from Heroku, the fact is, they are proposing to migrate to a situation where they are still completely reliant on AWS for their hosting.
I'm just not sure you can really say "We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to.", when at the end of the day, you are again relying on a company that has failed you in the past.
Not trying to attack Amazon or Heroku, I'm honestly intrigued by this issue; not to mention the fact that we are facing the exact same decision at work.