Just my 2 cents... the problems described here seem to fall into the category of...

Just my 2 cents... the problems described here seem to fall into the category of "things didn't work 100% of the time".

I have some bad news for you; nothing I have ever used worked 100% of the time. Doubly so on AWS. Just off the top of my head; just in the last week we have seen 5% of AWS instance act so badly that we had to recycle them, and that is just the easy to diagnose problems. Don't get me started on the IOPS marketing BS that Amazon sells.

It's just the reality of a lot of moving parts and complex systems. From the sound of it, the OP had very little _actual_ downtime, and had to make some end-runs a couple times. Shrug, just life in a high-scale world IMHO.

The goal should not be to bounce around providers until you find the Garden of Eden. The real goal should be to accept failure and build to tolerate it. Now maybe it is easier to build fault-tolerance if you are closer to the metal... BUT I would bet that Heroku has much better experience and tooling to detect and resist faults then rolling your own.