I get what they are doing, and I would probably handle it the same way, but I th...

elliottcarlson · on April 27, 2011

As a developer who uses AWS, it's complicated, but I agree that Heroku is at 100% fault.

If I were providing a paid service and my dedicated servers went offline due to some reason (let's say the fiber line gets cut by road maintenance crews) - my customers wouldn't care what happened - perhaps some will offer sympathy - but in the end it's my fault, and my responsibility to offer any kind of contingency plan - which generally includes load balancing over multiple datacenters etc. The same view should have been taken here and too much reliance was set on one region (even though Amazon promised it would be safe - this is their fault ;)). In the end, they should take full blame and now learn from their mistakes.

dholowiski · on April 27, 2011

If you replace Fiber line with Power line, I was in this exact position in real life once. The power went out in our Data Centre, and our backups all failed - UPS's, Diesel Generators etc. The customers who had their servers in our Data Centre didn't blame the guy who cut the power cable, or the power company. They blamed us, the company who ran the data center. The power outage wasn't our fault, the catastrophic failure and lack of disaster recovery planning was. Banks and Hospitals are not forgiving when you break their SLA's.

elliottcarlson · on April 27, 2011

Exactly - and I was actually in a position where the Virginia Dept of Transportation were doing roadwork and cut the fiber lines at the data center I was hosting at. ServInt did everything they could to remedy the situation and I remained a loyal customer for a long time after that, but the clients I was hosting really didn't care about that and moved away from me - and I respect that and understand that completely.

sarchertech · on April 27, 2011

The thing is, multiple availability zones are in multiple data centers. We now know that they have a common failure point, but how could anyone have known that before it happened.

For all we know multiple regions could have an undiscovered common failure point.

Don't get me wrong. Heroku isn't entirely blameless--I had a production app that was down for about 12 hours.

dpcan · on April 27, 2011

I'm just saying it isn't fair to Heroku to take 100% responsibility. Do you now have to go to the customers of YOUR app hosted at Heroku and take 100% of the blame? Or do you tell them it was Heroku? Were YOU using more hosts than just Heroku?

Where does it end? Is it 100% your customers' fault for using your service and not using multiple services that match your service to be redundant?

I'm just making conversation here now, but I feel like Heroku did not have to go this far.

dholowiski · on April 27, 2011

Yes. If I have an app that I host on Heroku and it goes down, it's MY fault for not putting in place the appropriate backups. I would apologize to my customers and offer them a refund for the downtime. I would not blame Heroku, or Amazon. Do you think my (hypothetical) paying customers care if my hosting provider goes down? No Fking way.

dpcan · on April 27, 2011

As someone who has been in the position of having to explain to my customers that our host went down, yes, they did understand, and because I had shown them through the years how hard I worked fir them not one expected a refund or anything. They knew that part of what they were paying me for was to fix these problems when the inevitably would come up. I'm not saying this will work for everyone, but I believe a lot of people are starting to understand the risks of doing business in the cloud, and the truth of the matter is, some patients is required.

elliottcarlson · on April 27, 2011

If I have a contractor come and redo my kitchen, and I am quoted a certain time frame then I will blame him when things do not happen on time - even if his supplier couldn't deliver the cabinets in time, or the hardware store was out of joint compound.

It's a sucky situation, and I feel for Heroku - and I blame AWS personally - but when it comes to the final customer, then as a service provider it's only right to take blame.

robryan · on April 27, 2011

Heroku's value proposition though is that they are taking away the back end worries for their customers so they can just deploy apps and not have to worry about it all. If I have to still worry about redundancy outside of their system then their service suddenly becomes a lot less useful.

I think the blame stops at Heroku who either need to provide more redundancy for incidents like this or let their customers know that if this type of incident arises there is little that they can do.

dpcan · on April 27, 2011

Exactly except that a problem came up, which it still could in the future even with these changes they are making, and when it does, you can probably rely on them to fix those problems because that is what you are paying for - someone else to run the back end, and that appears to be exactly what they are doing.