I just yanked elasticsearch out of an app and replaced it with PG's full text search.
It's like shedding a 500# pack from my shoulders. Everything is so much simpler now -- dev is easier, testing is easier, deployment is easier. Infrastructure costs are down.
Elasticsearch is a pretty neat tool, but what I was using it for was heavy handed. I only chose it because it was what everyone else was using it.
I did the same exact same thing with Solr a few years back. PG performed equally for speed, the results were more accurate and the data inconsistencies vanished. Solr was such a black box to debug that it couldn’t be trusted.
After I left the dev hired to replace me put Solr back.
We had about 14 million records and the full text search included 3 varchar 255 and a text that could get very long depending on the user. These were product auction listings.
Thanks. That's not a small dataset. It's significant you got good enough performance on this. How many docs matched a typical query? And what was a typical response time?
At the time, when we removed Solr and replaced it with an assembled query the performance was at least equal but our data issues vanished.
We were having serious issues with keeping the Solr dataset in sync with changes, invalidating old ones and getting new ones to appear. A user pays to create a listing, they expect to see the listing in the search...and sometimes that wasn't happening. When a listing expired, sometimes it wasn't going away. Being able to just keep this stuff up to date with a simple PG trigger completely eliminated the problems and made it easy to tune.
The search used a combination of full text, categories, geographic distance, user names and a couple of other filters. Being able to simply construct each of those pieces in a query, sometimes filtering parts with subqueries, etc was really effective. You had a few indexes involved.
Keep in mind, this is also before I knew anything about Postgres partitioning with PG Partman or how to create partial indexes. I could have made it a lot more efficient than it was, in hindsight.
The experience was good enough though that I don't even begin to think about pulling in a 3rd party search tool unless I have project requirements that make it unavoidable. Elastic Search is really good for use cases with constant streaming data ingestion, for example. If I just need search for data that's already in the database though...it's really hard to justify.
I did the same thing to one project a few years ago. It was a major win. Even with decent library support, keeping data synced and indexed from PG <-> ES was such a burden.
Yeah, if you have a simpleish app (or even not a simple app) you can probably benefit from playing around with how PG does search without needed ES/Solr/etc. But it really depends what you do, obviously.
the biggest issue with postgres search is the inability to use TF-IDF or BM25 (the current default and state of the art on elasticsearch). The current ranking system is not very relevant.
Anyone who is familiar with PG internals - is there something in the internal data structure that prevents a BM25 or TF-IDF style rank generation ?
I have a theory that if they incorporate these algorithms, it makes postgres potent enough that a lot of people may choose not to use elasticsearch/lucene.
But I would still use Elasticsearch since (a) everyone else does so lots of support, (b) ridiculous amount of plugins, drivers and rivers, (c) PostgreSQL has a far weaker scalability story, (d) Kibana is a great UI to help with debugging, (e) you can use ELK for your monitoring as well as search.
I could definitely come with a few more but Elasticsearch is a product that does search really well.
The first solution uses a materialised view. It essentially precomputes what the query can possibly return and uses up additional disk space, plus WAL traffic unless if it is configured unlogged. There has been no mention of how long it takes to rebuild such a materialised view: you will still have full table scans then, but at build time against the primary read instance.
The second solution is better, but please read documentation [1] before replicating it. Also to_tsvector takes an optional regconfig parameter [2] and here is what I get right now
$ psql
psql (9.6.2)
Type "help" for help.
evadne=# \dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | italian | configuration for italian language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | turkish | configuration for turkish language
(16 rows)
It's important to note that a simple to_tsquery(...) will not handle phrases, or prefixes, or even normalizing the input. There are plainto_tsquery(...) and phraseto_tsquery(...) methods but these don't have the same flexibility and also won't do prefixes.
If you want some normalization and prefix abilities, you can actually call to_tsvector(...) on your query input, then stitch that together into a better tsquery input.
SELECT STRING_AGG(lexeme || ':*', ' & ')
FROM UNNEST(to_tsvector('testing 123'))
We've always used pg_trgm with GiST indexes, with the same sort of materialized view described here, containing unaccented search terms. One of the big benefits here is we can then do really fast '10 best matches' by text similarity. Obviously this is a slightly different use case than full-text over largish documents, but it's great when you've got a database of foreign names that people only vaguely know how to spell.
For some production use cases, it is possible and others it is not or extremely difficult. There are a few reasons for this.
The quality of search is a combination of both precision, ordering by relevance, and recall, finding relevant documents in a corpus. PG FTS is primarily a recall engine since it does not have BM25, TF/IDF, etc built-in. Out of the box, it is like an easier to use regex rather than a full search engine. If you have a dataset with small N or relatively unique records then recall-based algorithms will still give good precision simply because there aren't that many results for a given query. However datasets with larger N or lots of similar records, precision becomes harder and you start needing BM25, machine learning, etc.
It is possible to incorporate other relevance signals (eg recency, popularity) to improve precision in PG but depending on the specific signal, it can be more difficult than ES or Solr. Things like auto complete, supporting multiple languages, spell check, scaling are also harder to implement in PG at the same quality and performance as ES/Solr.
Syncing the database and search server is a real problem so I usually recommend to stick with PG if you have a recall-based search problem. But if precision is important or one of those other features I mentioned is important than I recommend ES/Solr and just deal with the syncing problem because overall it will be easier.
The upside: only one service to maintain (PostgreSQL) instead of two (PostgreSQL + ElasticSearch) and not having to synchronize PostgreSQL and ElasticSearch (which is more difficult than it sounds).
The downside: ranking can be slow when your keywords match a large number of documents (have a look at github.com/postgrespro/rum for a solution), and faceting is more difficult to implement than in ElasticSearch
Solr and Elasticsearch are not relational database systems and have none of the optimizations for relational data that Postgres does. They are built on top of Lucene and target document retrieval via an inverted index [1] rather than dealing with relational data.
It's not just relational features that matter but the decades of work on reliability, performance, flexibility, security, and general usability of these databases that makes them so great.
ElasticSearch doesn't have ACID or very good OLTP manners. More importantly, it still continues to have data loss issues so it's not a good fit as a primary datastore.
It's not about results, the data itself within ElasticSearch is not guaranteed safe as a primary datastore, so you should have a reliable database for the primary data that is then synced to ES.
That's good for you, but it sounds like you're extrapolating it on just your experience rather than research? Perhaps you've been lucky? Do you really have the info to objectively state it as "simply not true"?
Read the jepsen tests (old but show core problems that aren't fixed):
After reading the links you gave, I’d like to clarify that I meant that edge cases that can cause data loss are not common enough, in my experience, for one to be forced to treat ElasticSearch as an unreliable data store. Anecdotally, I’ve never seen it happen. You’re right, maybe I’m lucky; I would like to see a case study of some project where these issues were significant enough to treat ElasticSearch as an unreliable data store.
I've never built something that did not need a relational database to model the domain, and search is such a common UX element that it's taken for granted in everything you use.
I can't even think of something that could use search that's not relational. Can you?
> I've never built something that did not need a relational database to model the domain
Yeah, so perhaps you don't have much experience from outside the relational model. Makes perfect sense to see a single truth then.
> I can't even think of something that could use search that's not relational. Can you?
Not sure how wide a blanket you're throwing by "not relational", but here are a couple examples that might fit: Google's search engine, most filesystems and their files, hierarchical and flat data structures in general, twitter.
Elasticsearch stores data in a JSON document store style format. So any situation where you need O(1) lookup for all data related to a particular entity you can have nested structures within JSON that facilitates this. In a RDBMS it is O(n) since you have to do a costly join for each embedded structure.
A document store is not a replacement for a relational database, and there are few cases where it's truly a better fit.
More importantly, ES is not a document store. It is a search system that can be used as a similarity index between all kinds of data, most commonly json docs but can be anything like images or audio waveforms.
ES can return none of the source data or not even store it in the first place. In addition, it lacks ACID and transactions and generally has an overall poor reputation for data integrity.
If you just want a json blob, you can already use a relational database column for it and get the key/value performance semantics, while still retaining all the reliability and manageability properties.
If you want to save JSON documents, you can pick Postgres [1] or MySQL and still have a RDBMS for free additionally.
Elasticsearch does that too, but its access certainly isn't O(1) - it has a complexity of O(log n), just like Postgres or MySQL would have, because all of them use tree-like index data structures. (Please note that ElasticSearch is not meant as a data storage layer)
Now, if you just want to have the JSON, you're done. O(log n) for ElasticSearch, Postgres or MySQL. That's it.
But normally, you want to do something with that data. You have a user and want all the products. Or you have a book and want all the authors, because you have a real app to write.
Most data in our world is relational: A student can partitipate in courses, a store sells products, and so on.
Queriyng your database using JOINs will make use of indexes in each of the tables that you have previously set up.
These indexes will prevent the theoretical worst-case (the "Full Table Scan", where your complexity is O(n) for the table). The query planner will decide which index is better to use for the optimal response time.
You can, however, choose to not leverage this feature of RDBMS. What do you do now, if you need all the courses of a student? Do you get all the JSONs and compare their IDs? Do you parse the JSON keys in your own app? But having your app iterate over all the JSON keys already has O(n) complxity. So you end up recreating tree-like indexes anyway somewhere in your stack.
But all of that is already provided for you, with decades of debugging and performance optimization - for free.
To sum up: Use Postgres JSONB type if you only need to write JSON documents. You might later want to query the nested JSON data structure and join it. I highly recommend "Mastering PostgreSQL in Application Development" [2]
The comparison was between ElasticSearch and a relational store. PostgreSQL's JSON type is not relational. And if you are fetching any entity by a primary key then it should be O(1). The point was that document stores offer you the ability to nest data structures whilst relational forces you to join. And so if that is your query/data pattern then a document store can be orders of magnitude faster than a purely relational store e.g. 360 Customer View.
Primary key fetching will always be O(log n) due to how indexes are implemented. Access time complexity will never be independent of the item count in your data store, relational or not.
There's not many use cases I come across where the 'nested' data is the only instance of that data. A student record might have some nested class information, but the class information itself lives as a master someplace else. If you nest things, you now have data living in multiple places, no? Updates need to update multiple locations, which would potentially introduce lots of errors.
I've got a handful of situations where nested structures are making sense, but it doesn't seem to be the norm (at least in the data worlds I work in).
>Is it possible to have production-quality search using just postgresql?
It depends on your search use case. If you are doing fairly simple searches such as trying to find, say, usernames, you can certainly have production grade search using Postgres.
However, if you have to handle searches that are not as targeted, Postgres will not be enough. One example is searching with synonyms - if you want a record containing "automobile" to match queries containing the term "car", ElasticSearch will serve you better.
Postgres is a relational data store and ElasticSearch is an information retrieval system; each focuses on their intended use case. The fact that Postgres supplies some elementary search capabilities does not mean that Postgres can well serve a bona fide information retrieval user case.
Depends on what you call production-quality, but definitely possible.
Simplicity of having it all in 1 system is great, however the search quality itself is usually average and configurability is limited. A real search engine will give you much better relevance rankings, facets, fuzzy search, and other fancy things you can do with a general "similarity index" like recommendations.
It's like shedding a 500# pack from my shoulders. Everything is so much simpler now -- dev is easier, testing is easier, deployment is easier. Infrastructure costs are down.
Elasticsearch is a pretty neat tool, but what I was using it for was heavy handed. I only chose it because it was what everyone else was using it.