Hacker News new | past | comments | ask | show | jobs | submit login
Elasticsearch node crashes can cause data loss (github.com/elastic)
112 points by felipehummel on May 2, 2015 | hide | past | favorite | 50 comments



Mandatory reading -- Last year's Call Me Maybe : Elasticsearch

https://aphyr.com/posts/317-call-me-maybe-elasticsearch

I've been hearing a lot of people talk about Elasticsearch lately. I get the same gut feeling I was getting about MongoDB back during the "Webscale" days.


In my experience, Elasticsearch is the single most common source of infrastructure downtime and service failure. It's basically my arch nemesis.


I am interested to hear a bit more about this, as I find it hard to believe. I have only ran it at pretty small scale - x8 servers, around 300 million documents indexed a day, peak index rate 30k docs/sec. I found that you have to monitor it correctly, tune the JVM slightly (Mostly GC), give it fast disks, lots of ram, and the correct architecture (search, index & data nodes) to get the most out of it. Once I did that it was one of the most reliable components of my infrastructure, and still is. I would recommend chatting to people on the elasticsearch irc, or mailinglist, everyone was a great help to me there.


The full explanation deserves a blog post, but in a nutshell it revolves around the issue that ES contains a huge amount of complexity around a feature that is actually fairly useless (the "elastic" part) or at least difficult to use correctly. I've found that you need to be a deep expert in ES to architect and run it properly (or have access to such expertise) and even then it requires regular care and feeding to maintain uptime. In a short-deadline startup world you probably won't have time for any of that--once it's working it will lull you into a false sense of security and then completely blow up a few weeks/months later.


Same here. A single node failure has lead to the whole cluster crashing down around me on more than one occasion.


Really? Perhaps I was never running it at a large enough scale, but even pre-v1.0 I've basically never had any troubles with it (outside of operation concerns like occasionally confusing query syntax.) Then again, I never had more than 11 servers in the cluster so again I may just have never run into problems at scale.


While I don't necessarily disagree, I do find that this depends entirely on how ES is used. All too often people dive headfirst into using elastic search in ways it really should not be used.


It can't be worse than RabbitMQ... can it?


I use ES only for search (indexes from a DB), so losing data isn't a massive drama, it's great for my usecase.


That sounds like the indended use. I should qualify my comment, I heard it advocated for a primary data storage.


I've only heard of very few cases where people were using ES as primary storage, and even there they acknowledged that they were probably crazy for doing so.


Yeah, I had an argument with that over at reddit. Where someone advocated ES as an alternative to Cassandra. >___<. I did hopefully, convinced the user otherwise.


Elasticsearch is just a text search engine base on lucene. You either use ES, Solr, or Lucene library if you want fuzzy search and such.

You really want to use it in tandem with a storage db PostgreSQL, Cassandra, MongDB. Where ES or any lucene based indexer/db would be use for text searching.

I personally like PostgreSQL and Cassandra, would use it in tadem with ES. Solr, last I check was a bit complicated to cluster.


Agreed. Cassandra is especially nice if you have the DataStax Enterprise version which allows for seamless integration between the two.


> Solr, last I check was a bit complicated to cluster

SolrCloud, with Zookeeper, is relatively new and not too difficult to set up.


Does it still have the issue where you have to take the cluster down to create a new index or modify existing ones?


No, search for MergeIndexes or --go-live.


What about storing data for analytics? Wouldn't it be better to use ES than Postgres for that?


The advice I've heard from serious people using Elasticsearch for serious things indicate that you should definitely not use Elasticsearch as a primary data store (i.e. it should be treated as a cache).


This is true. On the other hand, even a secondary data store that's considered "lossy" poses a challenge — how do you know if its integrity has been compromised?

In other words, if you're firehosing your primary data store into ElasticSearch, you'll want to know whether it's got all the data you pushed to it at any given time.

I suppose you could use some kind of heuristic to detect this, like posting a "checksum" document occasionally that contains the indexing state and thus acts as a canary that lets you detect loss. On the other hand, this document would be sharded, so you'd want one such document per shard. Is this a solved problem?


A logical "SELECT COUNT(*) WHERE updated_at < now()" is probably reasonably fast on your primary store and ElasticSearch.


Given that ElasticSearch is "eventually consistent", how do you know when it has caught up? How do you know what data is missing once the count is wrong?

It's solveable, of course, but it's a problem that pops up with any synchronization system, and I'm surprised nobody (apparently) has written one, because it requires a fairly good state machine that can also compute diffs. Once a store grows to a certain state, you do not want to trigger full syncs, ever.

The best, most trivial (in terms of complexity and resilience) solution I have found is to sync data in batches, give each batch an ID, and record each batch both in the target (eg., ElasticSearch) and in a log that belongs to the synchronization process. The heuristic is then to compute a difference beetween the two logs to see how far you need to catch up.

This will only work in a sequential fashion; if ElasticSearch loses random documents, it won't be picked up by such a system. You could fix this by letting each log entry store the list of logical updates, checksummed; and then do regular (eg., nightly) "inventory checks".


> Given that ElasticSearch is "eventually consistent", how do you know when it has caught up?

That's nice, in practice, though, ElasticSearch, doesn't behave like an eventually consistent system--it behaves like a flawed fully consistent system. It doesn't self-repair enough to be eventually consistent. If you get out of sync by more than a few seconds, you're going to have to repair the system manually in some fashion. It never "catches up."

Additionally, also in practice, most of the data loss (and real eventual consistency behavior) you'll see in an ElasticSearch+primary-data-store system isn't coming from within ES--it's coming from queues people typically use in the sync process. So there's a degree where you're going to need to handle this on an application-specific basis.

> How do you know what data is missing once the count is wrong?

In practice, people just ignore it or do a full re-index. Theoretically, you should be building merkle trees.

> Once a store grows to a certain state, you do not want to trigger full syncs, ever.

This is not really true. You need to maintain enough capacity for full syncs, because someone will need to do schema changes and/or change linguistic features in the search index.


Of course you need to be able to do full syncs, and the sync is not a problem. But one needs to solve the two challenges I have described:

1. Determine how to do an incremental update, given that only the tail of the stream of updated documents is missing. Not as simple as just counting.

2. Determine when you must give up and fall back to a full sync; this is when not just the tail is missing, and finding the difference is computationally non-trivial. You'll only want to do this once you're sure that you need to.

My point remains that ElasticSearch's consistency model means it's hard to even do #1, which is the day-to-day streaming updates.

My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue.

(In my experience, queues are terrible at this. One problem is that it's hard to express different priorities this way. If you have a batch job that touches 1 million database rows, you don't want these to fill your "real-time" queue with pending indexing operations. Using multiple queues leaves you open to odd inconsistencies when updates are applied out of order. And so on. Polling triggered by notifications tends to be better.)


> "My second point was that this — streaming a "non-lossy" database as a change log into one or more "lossy" ones — is such a common operation that it should be a solved problem. It certainly requires something more than a queue."

This is almost exactly the cross-DC replication problem, which is a subject of active research.

A changelog on the source side is only sort of helpful. It's useful to advise which rows may have changed, but given that you don't trust the target database, you also need to do repair.

Correct repair is impossible without full syncs (or at least partial sync since known-perfectly-synced snapshot), unless your data model is idempotent and commutative. On-the-fly repair requires out-of-order re-application of previous operations.

The easiest way to reason about commutivity is to just make everything in your database immutable. So this is a solved problem, but it requires compromises in the data model that people are mostly unwilling to live with.

You can do pretty well if your target database supports idempotent operations.

If you're trying to do pretty well, then you can do a Merkle Tree comparison of both sides at some time (T0 = now() - epsilon) to efficiently search into which records have been lost or misplaced. Then you re-sync them. Here, for efficency, your merkle tree implementation will ideally to span field(s) that are related to the updated_at time, so that only a small subset of the tree is changing all the time. This is a tricky thing to tune.

You'll still be "open to odd inconsistencies when updates are applied out of order" if you haven't made your data model immutable, but I think this is mostly inline with your hopes.


Merkle trees (as used for anti-entropy in Cassandra, Riak etc.) are only practical when both sides of the replication can speak them, though.

I wonder, are Merkle trees viable for continuous streaming replication, not just repair?


You'd implement the merkle trees yourself in the application layer. Alternately, you could use hash lists. It'd be somewhat similar to how you'd implement geohashing. Let's say you just take a SHA-X of each row represented as a hex varchar, then do something like "SELECT COUNT(*) GROUP BY SUBSTR(`sha_column`, 0, n)". If there's a count mismatch, then drill down into it by checking the first two chars, the first three chars, etc. Materialize some of these views as needed. It's ugly and tricky to tune.

Merkle trees aren't interesting in the no-repair case, as the changelog is more direct and has no downside.


It is often advocated as a datastore for logging data... which means (in that case) it's usually the primary datastore but perhaps not mission-critical.


It's a great index for log data.

Spew your log data into a standard syslog server, while also pumping it into Logstash.

Using Elasticsearch as your canonical log storage would be ridiculous.


Once you start relying on it to understand the state of whatever it's logging, it's mission-critical.


It would probably be good enough as a store for A/B testing information - losing data here isn't critical but writing speed is.


Crashes of a program will not affect the data being written to disk if said data has been written into the FS cache (not using std::ostream::write or other in user space buffering). Dirty pages will eventually be written to disk even if the process dies un-cleanly. Only something that keeps the kernel from flushing to disk can keep the page from being eventually written out. ( driver bug, kernel bug, hardware failure, power failure ).

From reading the code in Jepsen it looks like kill -9 is all that's being used to start failures. So there's a real bug here: https://github.com/aphyr/jepsen/blob/master/elasticsearch/sr...


I think Kyle was just going by the documentation. And that is often what he tests -- how does the reality compare to the claims in the documentation and marketing.

So given these claims:

> Per-Operation Persistence. Elasticsearch puts your data safety first. Document changes are recorded in transaction logs on multiple nodes in the cluster to minimize the chance of any data loss.

One would hope they at least flushed the user space buffers.


Suggested reading -- the link.

> by not fsync'ing each operation (though one can configure ES to do so).

It may not be default, but we've seen, again and again, how people are influenced by what they read about a database (e.g. MongoDB).

The lesson by now should be: always know your DB.


Fsync should always be on by default. Require the user to turn it off. I'd even argue that `fsync` itself is broken and that semantics should be inverted.


What about pervasive virtualization? The issue here is not really fsync. A fault-tolerant in-memory cluster should not lose data.


Writes should be durable by whatever means the platform deems durable. Durable should be the default. It should take work to have non-durable io.


> always know your DB

True, but Elasticsearch is not intended to be a permanent datastore.


Do I understand it correctly from skimming aphyr's article https://aphyr.com/posts/317-call-me-maybe-elasticsearch that TL;DR: Elasticsearch does not use a WAL with a real consensus algorithm such as Paxos or Raft, and therefore isn't reliable?


I'll just leave this here: one of my first answers on Quora about why ElasticSearch should not be a primary data store:

https://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-...


I hope they can use that $70m they raised last year to throw some engineers at their architectural issues and fix the data loss issue.

or maybe they're spending that money on marketing and re-branding.


Funnily enough I have seen a slew of technical bulletins from Cloudera warning of similar issues with HDFS.

Maybe not so funny if your multiply redundant cluster loses data because a single node dies...


Wow, that sounds bad and I don't remember hearing about it. Do you have any pointers to bug reports or descriptions of the problem?

HDFS uses chain replication, so I would have expected that by the time the client got acknowledgement of a write, it would already be acknowledged by all replicas (3 by default). So even if there's a bug causing one of the nodes to go down without fsyncing, there shouldn't be any actual data loss.


I think the client assumes the data is written after $dfs.namenode.replication.min blocks have been written, which I think is 1 by default.

What it actually means inside HDFS when it claims 'written', I'm not sure - I'd assume flushed to the dirty page buffer at a minimum and would hope fsync.


Yeah, I'm very interested in this also.


>>> OK its not simply that a node dies, but that disks on a node are replaced (which might sort of be related to a node dying).

TSB 2015-51: Replacing DataNode Disks or Manually changing the Storage IDs of Volumes in a Cluster may result in Data Loss Printable View Rate This Knowledge Article (Average Rating: 3.3) Show Properties « Go Back Information

Purpose Updated: 4/22/2015

In CDH 4, DataNodes are identified in HDFS with a single unique identifier. Beginning with CDH 5, every individual disk in a DataNode is assigned a unique identifier as well.

A bug discovered in HDFS, HDFS-7960, can result in the NameNode improperly accounting for DataNode storages for which Storage IDs have changed. A Storage ID changes whenever a disk on a DataNode is replaced, or if the Storage ID is manually manipulated. Either of these scenarios causes the NameNode to double-count block replicas, incorrectly determine that a block is over-replicated, and remove those replicas permanently from those DataNodes.

A related bug, HDFS-7575, results in a failure to create unique IDs for each disk within the DataNodes during upgrade from CDH 4 to CDH 5. Instead, all disks within a single DataNode are assigned the same ID. This bug by itself negatively impacts proper function of the HDFS balancer. Cloudera Release Notes originally stated that manually changing the Storage IDs of the DataNodes was a valid workaround for HDFS-7575. However, doing so can result in irrecoverable data loss due to HDFS-7960, and the release notes have been corrected.

Users affected:

Any cluster where Storage IDs change can be affected by HDFS-7960. Storage IDs change whenever a disk is replaced, or when Storage IDs are manually manipulated. Only clusters upgraded from CDH 4 or earlier releases are affected by HDFS-7575.

Symptoms If data loss has occurred, the NameNode reports “missing blocks” on the NameNode Web UI. You can determine to which files the missing blocks belong by using FSCK. You can also search for NameNode log lines like the following, which indicate that a Storage ID has changed and data loss may have occurred: 2015-03-21 06:48:02,556 WARN BlockStateChange: BLOCK* addStoredBlock: Redundant addStoredBlock request received for blk_8271694345820118657_530878393 on 10.11.12.13:1004 size 6098 Impact:

The replacement of DataNode disks, or manual manipulation of DataNode Storage IDs, can result in irrecoverable data loss. Additionally, due to HDFS-7575, the HDFS Balancer will not function properly.

Applies To HDFS All CDH 5 releases prior 3/31/15, including: 5.0, 5.0.1, 5.0.2, 5.0.3, 5.0.4, 5.0.5 5.1, 5.1.2, 5.1.3, 5.1.4 5.2, 5.2.1, 5.2.3, 5.2.4 5.3, 5.3.1, 5.3.2 Cause Instructions Immediate action required:

Do not manually manipulate Storage IDs on DataNode disks. Additionally, do not replace failed DataNode disks when running any of the affected CDH versions.

Upgrade to CDH 5.4.0, 5.3.3, 5.2.5, 5.1.5, or 5.0.6. See Also/Related Articles Apache.org Bug HDFS-7575

HDFS-7960 Attachment


That is a bug in CDH/HDFS, but it was an error and is now fixed. That's not diminishing the severity of the bug, but you can patch and get the correct behaviour, without a performance hit.

That is not comparable to what seems to be the case here with ES & MongoDB, where they deliberately (by design) accept the risk of data-loss to boost performance. Now most systems allow you to do make that trade-off, but an honest system chooses the safe-but-slow configuration by default, and has you knowingly opt-in to the risks of the faster configuration.

I hope you consider editing your initial post - if you conflate bugs with deliberately unsafe design, we just end up with a race to the bottom of increasingly unsafe but fast behaviour.


Title should have ElasticSearch in there...

I was thinking of NodeJS.

But the comment is correct, ES is not a db but a indexer and search engine.

edit:

Oh god, don't use it for storage. It index stuff.

You got a document, it'll store it in root words form so you can fuzzy search. It'll also do other NLP stuff to your document such as removing stop words. Once it hit an index you can store index value that point to your primary storage (cassandra, postgresql).

At least that's how I used it. If there is any better alternative I'd like to know about it.

edit:

I highly recommend: http://www.manning.com/ingersoll/

Taming text by Ingersoll and it won Dr. Dobb award too as a good book.


That was our mistake. We put it back. Thanks.


Who cares?

As long as we can cash in our options before the lawsuits come in, we win! Just like my last job in finance, actually.

Plus SQL's gross. You can't even webscale with it, and old people like it, so it must suck.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: