It's not that hard to keep your secondary indexes up to date with Postgres. You ...

atombender · on Oct 2, 2016

There are a bunch of pitfalls.

For one, listen/notify in Postgres wakes up all listeners on each notify. You will want a single stream of indexing updates, yet you will also want it to be fault-tolerant; if one indexer isn't running, another must take over its work.

Listen/notify also doesn't work well if you want to distinguish between batch updates (large imports, for example) and realtime ones.

Thirdly, if you miss out on events, you will want the syncing to catch up from where it was last.

The better design is to maintain state about indexing. If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas. Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change. This can be super useful for other purposes (auditing, debugging, support), and makes it easy to index from an earlier point in time.

Such a system will also let you more easily recover from hard failures in ES where entire shards (and their replicas) are lost (I have had it happen with 1.x, should be much less rare in 2.x).

Keeping one data store (ES) in sync with another (Postgres, or anything else) is an interesting problem in general.

ngrilly · on Oct 2, 2016

> If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas.

It's hard because concurrent transactions can commit sequential IDs or timestamps out of order. There can be a lag between the moment when the sequential ID or the timestamp was generated, and the moment when it was committed.

> Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change.

Yes.

atombender · on Oct 2, 2016

Good point about sequence IDs.

Another, mroe recent method that's bound to be more foolproof is to track the actual transaction log. Postgres exposes an API for "logical decoding" of said log. The downside is that it's fairly new, Postgres-specific, and I also suspect it puts more burden on the client, which has to interpret the entries and determine if/how they should be indexed. But in theory it gives you a completely data-loss-proof, application-independent way of streaming changes from Postgres.

ngrilly · on Oct 3, 2016

I agree, using PostgreSQL logical decoding is a good solution.

MySQL can do something similar using a replication stream (https://github.com/siddontang/go-mysql-elasticsearch) and MongoDB by tailing the oplog.

gnud · on Oct 2, 2016

And how do you get back in sync if the client that did the LISTEN, crashes and has to be restarted?

openasocket · on Oct 2, 2016

Using NOTIFY puts the data into a message queue, so messages will just hold there until your client listens again. Eventually it the channel queue will be filled up (8GB by default), at which point the NOTIFY will fail.

https://www.postgresql.org/docs/9.0/static/sql-notify.html

gnud · on Oct 2, 2016

I don't think it will? When the client restarts, it will create a new session with the postgres server. This new session will not receive messages that are queued for the old session.

openasocket · on Oct 2, 2016

Ah, yes, I think you're right. I suppose you would want to represent a message queue using a table and have your daemon process remove things from the table as it adds things to elastic search. That way if the daemon crashes the events will just continue to buffer.

yummyfajitas · on Oct 3, 2016

Same way you handle redundancy in general. You'll need multiple clients listening, and some sort of consensus algorithm to get them back into sync when one restarts.

One simple way (which involves doing double work) would be to have 2 clients, both of which listen to postgres and push to elastic search. Each sync will happen twice, but the sync will only fail to occur if both go down.