I've done two large search projects in the last year. One with Postgres search and one with elastic search. The elastic search version was trivial to configure, and was embarrassingly superior to the point where I wish I had never considered Postgres.
If you're using a secondary data store like Elasticsearch presumably you're combining it with a traditional relational database like Postgres or MySQL. And in most use cases, you're going to require additional constraints on top of full text, like restricting to a specific customer or topic, etc. This means data duplication. Any experienced developer should be breaking out in hives at this point because this means (1) keeping two data stores in sync; (2) adding another service to the deployment process; (3) adding another service to development environments; (4) adding another service to debug when problems crop up; (5) large scale schema changes and migrations are complicated because of two data stores; (6) operational overhead from maintaining Elasticsearch like separate backup scripts and deployment scripts and sharding config (across production, qa, and dev); (7) queries are potentially complicated because you may have to aggregate information from two data stores (pagination and delayed updates makes this even more hellish); (8) if your company is anal about devs accessing production data then ES just introduces another layer of requesting permissions and every time the schema changes you have to ask again.
And maybe the advantages of Elasticsearch far outweigh these negatives, but this is a conversation you're going to have with your devs because the cost of using ES is not zero. There are tradeoffs, and maybe just using Postgres full text search is adequate for your use case given sufficient additional constraints.
Data duplication virtually always happens as the organization gets larger. Even if you stick to RDBMS-only, you're going to get it once you setup master/slave replication, batch jobs, data science, tape backups, etc.
It's not a serious problem as long as you designate one data source as primary and every other data source as derived. That way, if you ever get data corruption issues, you can re-create the derived data and blow away the old copy. You also don't face complex data-sync issues, because all data-transfer is unidirectional. You write into Postgres or your other source-of-truth, and then you write from Postgres to ElasticSearch. If there's a difference between them, the Postgres version is authoritative, so copy its data over ElasticSearch.
Most of your other points are common to adding any new major feature, and you should always weigh the user benefits against the maintenance costs of a large new dependency. Presumably, if you're looking at ElasticSearch, it's because full-text search in your RDBMS is not good enough (which won't always be the case, but has been my experience and apparently the experience of several other people here).
The devops issues aren't a valid counter-point. Tuning an RDBMS is an art.
It's much easier to scale a ephemeral search service than a transactional database. Likely even one search node will support a higher query throughput.
add haystack to installed_apps, add the elastic search backend and set the end point.
use a haystack index class that is prebuilt to hook into all model change signals.
done. No weird configurations, everything was vanilla.
Maybe I am missing a step but it took me less than an hour to get everything going and it hasn't had to have any maintenance.
When you are adding ngram support and all the indexes and views in postgres to replicate the behavior, ES looks to be less complicated. At the very least I don't see a reduction in complexity doing the postgres way just that you have 1 less dependency to worry about.
With this solution, it looks like Django is keeping the Elasticsearch index in sync with the SQL database. What happens if the Django process crashes after having committed data to the SQL database but just before having updated the Elasticsearch index? How do you reconcile the index with the database after the fact?
Three options, depending on how demanding your users are:
1.) Don't. So what? That entry will never show up in search results, which is probably exactly what would've happened if you use a search engine with poor ranking, and exactly what will happen if you don't provide search at all.
2.) Blow away your index periodically and re-create it at off-peak times, or upon crash. Works as long as your data set is small enough to read it all off-peak.
3.) When you read your search results back, check them against the source-of-truth and re-index anything that's inconsistent. Relatively easy if there's a 1:1 correspondence between ElasticSearch documents and RDBMS tables; gets more difficult with complex joins.
ElasticSearch has a very good introductory guide included in the documentation. Go through the first half of the Getting Started section and you should be well on your way, with most of the following sections being just for fine-tuning:
Depending on your problem domain, you either do periodic dumps of your data into ElasticSearch (oftentimes done nightly, when server load is low), or you do a dual-write layer in your application.
Search is one area where user expectations around freshness are somewhat relaxed. If your search results don't include up-to-the-minute results, in most domains users will forgive you. It's better than having poor relevance.
If up-to-the-minute results are critical, then update your DB abstraction layer to write into ElasticSearch as well as PostGres. It's a pain, but manageable. Consistency, again, is not as critical for search as for a source-of-truth store, because a missing or stale record is the same as a poor ranking algorithm that didn't retrieve it in the first place. And if your application detects a bad search result, they can always go back to the source-of-truth and re-index it.
"oftentimes done nightly, when server load is low" what does this mean? Unless you are running a highly localized website, websites don't have a nighttime. We see pretty much constant usage 24/7 (though a bit higher when it is daytime in Asia). This probably would have made more sense thousands of years ago when the world was flat and the sky barge would pull the sun over the horizon into the underworld to travel back to the other side during the night, but nowadays? Not so much.
I have used ElasticSearch also on a number of projects and it is an extremely simple to install, configure, use and manage. Most of the functions are trivially available over simple REST HTTP calls.
Unless you need basic search and you desperately want to keep everything in the same database I would split it out. Architecturally it makes more sense to split it out just to allow for different scaling needs.
With Elasticsearch and Solr, you can easily customize analysis and scoring. There are several scoring algorithms built into them [1] such as BM25 (the default now) which is considered the state of the art for keyword relevance. For analysis, you can remove stopwords, stem, apply synonyms, etc [2]. Elasticsearch is specifically designed to scale across multiple machines which is necessary for TB datasets. There are also things like "more like this" queries and context-aware spell checking. Some of that you can do with PG, but not all of it. If PG can do it, it is usually harder to set up.
Rich library of tokenizers and analyzers. A test proof analyzer model and pipeline. For full text search, different score modes are supported that is beyond the trivial case mentitioned tf-idf model, how are going to do field centric ranking in postgres?
As far as I am concerned, it is far superior to the goodies that mentioned in this article.
For many the source of truth is still Postgres, but you may want some of the flexibility and search power that comes along with ElasticSearch in many cases. An interesting project to look at may be ZomboDB which lets you maintain ElasticSearch indexes from directly within Postgres and query from within PG directly. (https://github.com/zombodb/zombodb)
Postgres was so much easier for me. In fact I cannot see the reason Elasticsearch is so popular. For our website search I first researched Elasticsearch. Each of the short pages of the so-called Getting Started guide (https://www.elastic.co/guide/en/elasticsearch/guide/current/...) felt like walking into either (a) a brick wall or (b) an empty room:
p. 1: sales pitch
p. 2: sales pitch
p. 3: "The only requirement for installing Elasticsearch is a recent version of Java." Ugh.
pp. 4-10: A new JSON API to learn. This is just the part about getting your stuff into Elasticsearch and, at last, doing a query that is no better than "select * from table where last_name like '%Smith%';
p. 11: "More complicated searches," an apt title. Here is the Elasticsearch query that's equivalent to "select * from table where last_name like '%smith%' and age > 30":
It only gets more complicated from there. There is talk about ranking, but it was still a black box even by the end of the Getting Started guide, and I was going to want to tweak it. It just seemed more obvious how to do so with Postgres (https://www.postgresql.org/docs/9.6/static/textsearch-contro...).
How is Postgres merely "good enough"? How are other products better? What is their draw?
Is it ease of set-up? For me Elasticsearch was harder to install than Postgres and made me learn a bunch of new syntax, beyond the SQL I have to know anyway to write web applications in the first place. Postgres handles stemming, stopwords, ranking, snippets, and highlighting.
Is it better ranking? Postgres can weigh the search words' frequency, their proximity to each other, the length of the document, the number of unique words in the document, and the "mean harmonic distance between extents." I was able to further tweak the results by using other columns and joining other tables.
Is it faster search results? Postgres searches tens of thousands of pages in a split second.
I've migrated last week 500M documents (100GB) of data from mongodb to ES. It's quite simple and straightforward. Used the bulk API. Just make sure to have a good machine (16vCPU) and use parallelization and just wait. after 24hours, it just works.
Elasticsearch can be installed in seconds with most package managers and syncing is trivial depending on what you want to index. You can use a JDBC river for bulk indexing or rely on trigger's.
Eh, that's more for logs specifically. The "syncing story" is something that depends on your technology platform and specific situation. I adopted ES when it had rivers, but never even considered using them for syncing.
> syncing is trivial depending on what you want to index
Can you explain how you keep your Elasticsearch index consistent with your main database, in face of system failure (for example a network partition or a machine hard reboot)? This is usually the non-trivial part.
Plus it's in your database, which means it is trivial to limit the search results by any JOINable criteria, i.e. permissions, roles, groups, users, organisations or anything that you have in your database.
Also because it's your authoritative datastore you don't need to worry about keeping your secondary indexing system (solr/elasticsearch etc) in-sync.
Using Postgres for search just from a simplicity point of view can save many engineering hours especially if you don't have an expert in Solr/Elasticsearch available. This is even more true these days with SolrCloud needing a Zookeeper ensemble as well.
It's not that hard to keep your secondary indexes up to date with Postgres. You can set up external triggers with LISTEN/NOTIFY; i.e. every time a row is updated, Postgres will inform the index refreshing service that an update is needed.
For one, listen/notify in Postgres wakes up all listeners on each notify. You will want a single stream of indexing updates, yet you will also want it to be fault-tolerant; if one indexer isn't running, another must take over its work.
Listen/notify also doesn't work well if you want to distinguish between batch updates (large imports, for example) and realtime ones.
Thirdly, if you miss out on events, you will want the syncing to catch up from where it was last.
The better design is to maintain state about indexing. If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas. Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change. This can be super useful for other purposes (auditing, debugging, support), and makes it easy to index from an earlier point in time.
Such a system will also let you more easily recover from hard failures in ES where entire shards (and their replicas) are lost (I have had it happen with 1.x, should be much less rare in 2.x).
Keeping one data store (ES) in sync with another (Postgres, or anything else) is an interesting problem in general.
> If each table has a sequential ID and an "updated_at" column, you can exploit those to build deltas.
It's hard because concurrent transactions can commit sequential IDs or timestamps out of order. There can be a lag between the moment when the sequential ID or the timestamp was generated, and the moment when it was committed.
> Another option is to maintain a "changes" table that simply tracks the ID of each modified row, with a sequence number per change.
Another, mroe recent method that's bound to be more foolproof is to track the actual transaction log. Postgres exposes an API for "logical decoding" of said log. The downside is that it's fairly new, Postgres-specific, and I also suspect it puts more burden on the client, which has to interpret the entries and determine if/how they should be indexed. But in theory it gives you a completely data-loss-proof, application-independent way of streaming changes from Postgres.
Using NOTIFY puts the data into a message queue, so messages will just hold there until your client listens again. Eventually it the channel queue will be filled up (8GB by default), at which point the NOTIFY will fail.
I don't think it will? When the client restarts, it will create a new session with the postgres server. This new session will not receive messages that are queued for the old session.
Ah, yes, I think you're right. I suppose you would want to represent a message queue using a table and have your daemon process remove things from the table as it adds things to elastic search. That way if the daemon crashes the events will just continue to buffer.
Same way you handle redundancy in general. You'll need multiple clients listening, and some sort of consensus algorithm to get them back into sync when one restarts.
One simple way (which involves doing double work) would be to have 2 clients, both of which listen to postgres and push to elastic search. Each sync will happen twice, but the sync will only fail to occur if both go down.
It's a little sad that people seem to forget that Sphinx Search exists. For quick full-text search implementation Sphinx is much easier than trying to spin up and understand Solr or ElasticSearch.
For a large number of sites search could be improved greatly by simple setting up Sphinx with an SQL query for slurping up data from you backend database.
As somebody that has used both quite a bit in the last year, I have to disagree. Compared to Elastic Search, Sphinx is harder to install and configure, harder to scale, harder to work with,
offers fewer features and is improving more slowly. It might have been a competitor a few years ago, but now I can't see any reason to choose it.
Yes, but how do you make sure your Sphinx index is always synchronized with your SQL database, within a short window (a few seconds), and without to much pressure on the database?
Why is that necessary? One of the things which makes maintaining a search index easier is that typically your search datastore has much weaker consistency and durability requirements than your primary one.
Because in our application, users can add new documents, and will be confused if the document they just added is missing in the search result (which by default show the most recent documents). Considering that Elasticsearch website and documentation emphasize "real-time" features, I think it's not a weird requirement :-)
That's a fair point, but most other search solutions have the same issue. If you can't deal with search being a bit behind your actual content, and you want something better than the built in Postgresql full text search, you can live update a sphinx search index. It's a bit more work, but very easy to do.
I was mostly thinking of Sphinx as a solution for smallish site, that just need the search box to be working. Even though is works just as well for site like Craigslist (but I don't think they use the SQL data source)
TBH if you're looking to do something serious & consumer facing you may want to look into paying for Google's Custom Search API, or if your data set differs dramatically in shape from webpages, hiring a really good data scientist and rolling your own:
I'm an ex-Googler who's now working fairly extensively with ElasticSearch. While I've been impressed at how easy it is to get started, all the published academic ranking algorithms are utter crap compared to what Google has. Okapi BM25 is mentioned as the state-of-the-art up-thread; it was developed in the 80s, long before Google was born, and most of the advances in search since the 90s have happened behind closed corporate doors inside Google, Yahoo, and Bing.
This is solid advice. At my last company we had SOLR. I think we would have even qualified for what most people would call a "fair" use case of a search stack. Millions of items in the catalog, and millions of users to go along with it. Lucene is light years faster than a SQL engine with such requirements for search.
What is it not better at though? Pretty much everything else. GIS, Cross joins, parent child indexing, etc. Postgres can do it all too with a little digging.
The thing neither of them has? Semantic search. If I ever hit that scale again at another company, the jump won't be from SQL to Lucene, but from SQL to Google.
You can do semantic search with Solr and ES, it just isn't built-in. But it has support for stemming and synonyms which are the building blocks. You have create the synonyms file yourself with word2vec or a similar algorithm.
Has Google and/or Microsoft made significant advances in keyword relevance algorithms specifically? Google is a combination of keyword relevance + reputation (PageRank) + semantics (word2vec) + AI + other stuff when combined is the best search engine in world. Is the keyword relevance part really that much better than BM25?
This is a great suggestion. If site requires SEO why bother rolling another search engine. Everyone is just going to complain that it isn't google anyways.
Can you describe a use case where postgres performs badly, or why you prefer XYZ ?
In my experience, on commercial projects, using 9.4+ postgres GIN tag/keyword search index is very fast over data on the order of 30 to 100 million records, providing you have a host with SSD disks.
I call this size data "MID" data, not "BIG" data, and it seems applicable to a lot of startups before they get to a million customers, and for a lot of real world customer data use cases.
Also, a lot of startups need to handle gps geo-coordinates .. having used them, I have to say that the primitives that come with postGIS are pretty superb, and easily integrated.
Outside startup-land, there are a lot of companies using horrible data systems, dying to get to their data quickly, who could be much better served with postgres via a web front end. ( Lowering the cost/time to build that is something I'm working on )
So if :
* you want fast tag search
* your data has locations or nearby-x
* you'll have < 1bn rows for some time
* your data is worth money
then imo, postgres+postGIS+GIN+SSD is pretty compelling.
That covers a lot of startup and business surface area.
I don't think it's the number of users necessarily, it's about the expectations. People seem to expect Google quality search behind every search box on the web. The search needs to find the relevant information, first time, every time.
So it's more about being able to find what people are looking for, not the number of users you have. And to be fair, almost all search solution currently available will scale beyond what most of us need in terms for number of users or the size of our datasets.
Saying postgres is "good enough" acknowledges that elastic search is "better" but based on minimum viable product requirements may not be necessary. It's a fair point that is mostly valid when approaching a deadline on a fixed bid contract.
Given enough sprints, the search engine will eventually be run by an external service. +1 for the Google Custom Search suggestion. If SEO matters to the problem domain, might as well leverage those carefully indexed documents.
I usually choose in favor of using Postgres for RDMS and search, but I like to write a thin abstraction layer in case I later need to change my mind. The overhead (coding wise) for doing this is fairly small and this approach can save having to change a lot of application code later.
I was going to post the same thing. As far as I can tell, there are no term or document frequencies in postgres. Then again, I have seen several Lucene implementations with either or both are disabled. Can you even get an explain back from postgres? The main difference between a database and a search engine (a subtype) is the ability to score and rank. Take away the ability to score effectively diminishes the value.
BM25 is simply TFIDF with a couple of tunable parameters. These parameters will help with precisely the issues which force people to disable TF (saturating the value early, ...). From what I can see, Elasticsearch is mainly used as an analytics/logging engine and not search. Many NLP heavy organizations are using Solr (or straight Lucene).