Hacker News new | past | comments | ask | show | jobs | submit login
Game day exercises at Stripe (stripe.com)
92 points by gdb on Oct 28, 2014 | hide | past | favorite | 32 comments



It is good that they are testing, but the mindset that this is a special occasion seems very weird to me.

Rule #1 of programming is that if you didn't test it, it doesn't work. (It may still not work for real after you test it, but at least it's got something.)

You can't claim to anyone, or even yourself, that you have some kind of fault-tolerant system if you don't do this kind of test after every change.


We only started doing this about 3 years ago after I heard John Allspaw speak at Velocity. It has cut down dramatically the "Oh, that system that we thought was redundant was only redundant in the sense of an appendix, not in the sense of kidneys..." It was initially a hard sell to my ops team to implement it. (I promised them air cover for any losses we incurred as a result of game-day testing, of course.)

I agree with you that it should be common, but I'd suspect an honest survey of the field would show that it's far from standard practice. I'd guess fewer than 25% of companies do this in any meaningful way.


Reminds me to Netflix chaos monkey [1].

They shut down servers and components of the system to make sure they are resilent to failures.

> The Simian Army is a suite of tools for keeping your cloud operating in top form. Chaos Monkey, the first member, is a resiliency tool that helps ensure that your applications can tolerate random instance failures

More interestingly is that it is opensource. [2]

[1] http://techblog.netflix.com/2012/07/chaos-monkey-released-in...

[2] https://github.com/Netflix/SimianArmy


I've heard this (or similar tests) also referred to as "fire drills".

The first one I like to do is to restore the entire infrastructure from the off-site backup. When I know this is good I can sleep soundly.


So Stripe is the company that configured their master to no persistence, killed it, and blamed Redis for losing their data? Annoying they reported the "issue" to Aphyr, decided to withhold their name, and proceeded to spread a ton of FUD against Redis when the core issue was a mis-configuration on their part. Hand-holding can only go so far.


The main point of the post is that if there's a configuration you expect to work, you need to test it, independent of whether or not it's supported by the author. I can think of a dozen other distributed system failures I've seen that happened despite using stock configuration — this particular failure was simply a recent example, and helps illustrate why game days are so valuable.

On the plus side, Antirez has said that this configuration will soon be supported: https://news.ycombinator.com/item?id=8522630 (thanks Antirez!), so future users will be able to run in this configuration safely.


I like the part about three replicas of redis destroying all the data when the primary went down. If anyone out there is using Redis as more than a cache then you're doing it wrong.


This is not what happened, a replica with persistence turned off restarted with an empty data set, and the slaves replicated that empty data set. Basically Redis replication currently must be used with some form of on-disk persistence turned on. However after the introduction of diskless replication (http://antirez.com/news/81) we now have a good reason to support replication with persistence turned down in a proper way, so there is already a feature work in progress in order to support the Stripe-alike use case: https://github.com/antirez/redis/issues/2087


If the promotion logic is wrong how would persistence have helped? Say the primary disk fails. If it's still considered primary when it's brought back up with a fresh disk, wouldn't you get the same empty-replication problem? (I know nothing about redis, just wondering.)


In the case described, there is no promotion logic.

The replicas will try to reconnect to their original master forever unless something else (like Sentinel) redirects them in an actual failover/promotion setup.

So, the master had data, it died, it restart with no data, then the replicas immediately reconnect. If the master had persistence enabled, it would have reloaded the old dataset on startup and the replicas would have re-downloaded everything—since they are replicas of the master, they will always prefer the master data over their own, even if the master is empty.

If you were in a strange case where the disk failed and you replaced it with an empty disk (is that what you mean by "fresh disk?") then it's the same as starting an empty dataset. Not entirely relevant since the server would be intentionally started empty after a maintenance action instead of just restarting the already-populated process that restarts as empty because there's no saved dataset to load on startup.

The "all replicas resync an empty dataset" is a logical consequence of the configuration they enabled, but one without obvious repercussions without either directly experiencing it or a longer multi-chain thought experiment. (but, fixes for such things are already on the way—soon!)


Just to add some more info:

Funny enough what triggers this problem when you have master persistence turned down is, the lack of failover, if the reboot happens fast enough, in case you are using Sentinel, for it to failover to a replica. So no failure was sensed at all, just the master magically wiped its data set.

So from the point of view of distributed systems, if you want to analyze the sum of Redis replicated nodes + Sentinel, the problem is that the system is not designed to cope with nodes losing state on restarts.

However it is possible to improve it, and I'm doing it, but before diskless replication it was IMHO pretty useless to have support for persistence-less operations in conjunction with replication, since the slaves to synchronize required anyway the master to save an RDB file on disk.


Yes, it's obvious something went wrong with their promotion logic but in a complicated enough system this is going to happen and if anyone is using a volatile store as anything other than a cache then this is going to bite them in the ass sooner or later.

I'm happy to see there will be features added to support the use cases that Stripe and others have in mind.


Redis is not volatile when on-disk persistence is turned on. It's non-volatile and non-atomic.


Well in this instance the master did not have on-disk persistence turned on because of latency issues but the slaves did have it turned on. What would you call that kind of setup? Volatile or not?


Quick semantic cleanup: Redis has two types of persistence: snapshot persistence (RDB) and journal persistence (AOF). It sounds like they tried only snapshot ("copy the world") persistence, found it to increase latency, then turned it off. It probably would have been okay with journal persistence.

Their setup would have been fine _if_ the master never came back up. If the master didn't restart, then something in their infrastructure would have probably started using the replicas for data (and/or they would have promoted one of the replicas to be the new master, then the other replica would replicate from the new master).

But—having persistence on the replicas did allow them to copy the dataset off to backup storage. Then they were able to restore the old data when it was needed. So, we can prove their setup was persistent since they lost all their data and recovered it. :)


Thanks for the clarifications.


The issue was not latency of the replica but the logic used when restarting. It's just a configuration bug.

In a distributed system, some nodes can act like a cache and others like a persistent store.


Ever seen anyone drop tables from a MySQL master to have that statement replicated across the cluster? (I don't think this problem set is correlated with cache vs. primary store, but with the fact you have some kind of replication going on.)


PagerDuty implemented a similar type of test called "Failure Fridays":

https://blog.pagerduty.com/2013/11/failure-friday-at-pagerdu...

The learnings from forcing actual failure in production parts of your stack are incredible.


How are they getting such low write times with Postgres relative to Redis with comparable datasets and throughput? It's been a while since I used Postgres, are they implying that they're using a particular add-on or that they've optimised their queries for it?


No, their tests don't imply that they are getting better write times with redis across the board, they are measuring the 99%-tile latency.

What they have found, in a worst case scenario Redis is worse than Postgres.

So for example, their Redis response times might look like ,{1ms, 1ms, 1ms, 9ms}, while their Postgres response times might look like {3ms, 4ms, 3ms, 4ms}. Looking at it this way you can expect redis to perform better, however that outlier is worrying. For stripe they value having consistent and predictable performance over varying performance.


That's really interesting, thanks!


And still, every other day you hear people using cache as primary data store.


That's kinda an obtuse version of a middlebrow dismissal. You're positioning yourself as "of course—I knew better all along! Bow before my foresight!"

In other anecdotal evidence, I've been using Redis as a persistent store since 2011 and haven't lost any data.

There's always the giant caveat: just because software says it does X doesn't mean it does X until you have seen it actually happen. In this case, restarting a zero-persistence master was bad because it goes run->die->restart[empty]. Then, the replicas immediately reconnect, resync[empty], and now they are completely up to date with the master (which is the only Redis contract asked for here: always be an exact copy of your master; since persistence was not requested, the replicas re-sync an empty dataset).

In better news: there are already fixes for each of these issues showing up in Redis very quite Real Soon Now.


That wasn't my intention at all.

Anyway. Redis is considerably mature but still have edge cases that will literally void all you data. And that's definitely not okay when it's being used as your source of truth.


Why not FoundationDB?


It also literally states it's a key-value store on their site as the first sentence. Redis is not just a cache anymore.


This is such a good idea. Thanks for the write-up / sharing.


what software are you using to plot those graphs?



To be explicitly pedantic -- graphite-web plots the graph, whereas graphite manages time series data processing and storage.


To increase the pedantic-levels: these are not plotted by graphite-web. In this case graphite-web only spits out source data in csv or json, actual plotting is done by Grafana (see stereo typical Grafana legends on the screenshots).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: