Setting Up a Fast, Comprehensive Search Routine with PostgreSQL

deedubaya · on July 29, 2018

I just yanked elasticsearch out of an app and replaced it with PG's full text search.

It's like shedding a 500# pack from my shoulders. Everything is so much simpler now -- dev is easier, testing is easier, deployment is easier. Infrastructure costs are down.

Elasticsearch is a pretty neat tool, but what I was using it for was heavy handed. I only chose it because it was what everyone else was using it.

brightball · on July 30, 2018

I did the same exact same thing with Solr a few years back. PG performed equally for speed, the results were more accurate and the data inconsistencies vanished. Solr was such a black box to debug that it couldn’t be trusted.

After I left the dev hired to replace me put Solr back.

ngrilly · on July 30, 2018

What is the size of your dataset (number of documents and average size of documents)?

brightball · on July 30, 2018

It’s been a while so I’m going from memory here.

We had about 14 million records and the full text search included 3 varchar 255 and a text that could get very long depending on the user. These were product auction listings.

I don’t remember the size specifics though.

ngrilly · on July 30, 2018

Thanks. That's not a small dataset. It's significant you got good enough performance on this. How many docs matched a typical query? And what was a typical response time?

brightball · on July 30, 2018

It's been a long time. I honestly don't remember.

At the time, when we removed Solr and replaced it with an assembled query the performance was at least equal but our data issues vanished.

We were having serious issues with keeping the Solr dataset in sync with changes, invalidating old ones and getting new ones to appear. A user pays to create a listing, they expect to see the listing in the search...and sometimes that wasn't happening. When a listing expired, sometimes it wasn't going away. Being able to just keep this stuff up to date with a simple PG trigger completely eliminated the problems and made it easy to tune.

The search used a combination of full text, categories, geographic distance, user names and a couple of other filters. Being able to simply construct each of those pieces in a query, sometimes filtering parts with subqueries, etc was really effective. You had a few indexes involved.

Keep in mind, this is also before I knew anything about Postgres partitioning with PG Partman or how to create partial indexes. I could have made it a lot more efficient than it was, in hindsight.

The experience was good enough though that I don't even begin to think about pulling in a 3rd party search tool unless I have project requirements that make it unavoidable. Elastic Search is really good for use cases with constant streaming data ingestion, for example. If I just need search for data that's already in the database though...it's really hard to justify.

ngrilly · on July 31, 2018

Great info. Thanks!

nickjj · on July 30, 2018

I did the same thing to one project a few years ago. It was a major win. Even with decent library support, keeping data synced and indexed from PG <-> ES was such a burden.

benoror · on July 30, 2018

fwiw, I ran into issues when trying to perform more robust search algorithms in PG, such as fuzzy search: https://medium.com/the-backlog-by-nimbo-x/dev-notes-fuzzy-se...

thejosh · on July 30, 2018

Yeah, if you have a simpleish app (or even not a simple app) you can probably benefit from playing around with how PG does search without needed ES/Solr/etc. But it really depends what you do, obviously.

ngrilly · on July 30, 2018

What is the size of your dataset (number of documents and average size of documents)?

sandGorgon · on July 29, 2018

the biggest issue with postgres search is the inability to use TF-IDF or BM25 (the current default and state of the art on elasticsearch). The current ranking system is not very relevant.

Anyone who is familiar with PG internals - is there something in the internal data structure that prevents a BM25 or TF-IDF style rank generation ?

The work on RUM seems to have stagnated (and TF-IDF was a todo anyhow here) https://github.com/postgrespro/rum

I have a theory that if they incorporate these algorithms, it makes postgres potent enough that a lot of people may choose not to use elasticsearch/lucene.

1996 · on July 29, 2018

Then by your own theory, unless there is a conspiracy within the devel community, this will be made - if only as extensions with commercial support.

threeseed · on July 29, 2018

I had never heard of these algorithms until now.

But I would still use Elasticsearch since (a) everyone else does so lots of support, (b) ridiculous amount of plugins, drivers and rivers, (c) PostgreSQL has a far weaker scalability story, (d) Kibana is a great UI to help with debugging, (e) you can use ELK for your monitoring as well as search.

I could definitely come with a few more but Elasticsearch is a product that does search really well.

aisofteng · on July 29, 2018

These are standard information retrieval algorithms that have been known and used for decades.

evadne · on July 29, 2018

The first solution uses a materialised view. It essentially precomputes what the query can possibly return and uses up additional disk space, plus WAL traffic unless if it is configured unlogged. There has been no mention of how long it takes to rebuild such a materialised view: you will still have full table scans then, but at build time against the primary read instance.

The second solution is better, but please read documentation [1] before replicating it. Also to_tsvector takes an optional regconfig parameter [2] and here is what I get right now

    $ psql
    psql (9.6.2)
    Type "help" for help.

    evadne=# \dF
                   List of text search configurations
       Schema   |    Name    |              Description              
    ------------+------------+---------------------------------------
     pg_catalog | danish     | configuration for danish language
     pg_catalog | dutch      | configuration for dutch language
     pg_catalog | english    | configuration for english language
     pg_catalog | finnish    | configuration for finnish language
     pg_catalog | french     | configuration for french language
     pg_catalog | german     | configuration for german language
     pg_catalog | hungarian  | configuration for hungarian language
     pg_catalog | italian    | configuration for italian language
     pg_catalog | norwegian  | configuration for norwegian language
     pg_catalog | portuguese | configuration for portuguese language
     pg_catalog | romanian   | configuration for romanian language
     pg_catalog | russian    | configuration for russian language
     pg_catalog | simple     | simple configuration
     pg_catalog | spanish    | configuration for spanish language
     pg_catalog | swedish    | configuration for swedish language
     pg_catalog | turkish    | configuration for turkish language
    (16 rows)

[1] https://www.postgresql.org/docs/current/static/textsearch-co...

[2] https://www.postgresql.org/docs/current/static/textsearch-in...

1996 · on July 29, 2018

A larger problem is the need to precompute.

Regular views are just as slow (or fast) as the normal code.

It seems something inbetween would be needed- not just a trigger, but a regular update as data is appended, maybe in chunks?

I do not know psql much. It may already exist.

manigandham · on July 29, 2018

A trigger to do the update for the tsvector field works just fine, and is the normal implementation.

manigandham · on July 29, 2018

It's important to note that a simple to_tsquery(...) will not handle phrases, or prefixes, or even normalizing the input. There are plainto_tsquery(...) and phraseto_tsquery(...) methods but these don't have the same flexibility and also won't do prefixes.

If you want some normalization and prefix abilities, you can actually call to_tsvector(...) on your query input, then stitch that together into a better tsquery input.

   SELECT STRING_AGG(lexeme || ':*', ' & ') 
   FROM UNNEST(to_tsvector('testing 123'))

This will return `123:* & test:*` which can then be used for searching. Full example here: https://stackoverflow.com/questions/46122175/fulltext-search...

thom · on July 29, 2018

We've always used pg_trgm with GiST indexes, with the same sort of materialized view described here, containing unaccented search terms. One of the big benefits here is we can then do really fast '10 best matches' by text similarity. Obviously this is a slightly different use case than full-text over largish documents, but it's great when you've got a database of foreign names that people only vaguely know how to spell.

brightball · on July 30, 2018

trgm is excellent for more fixed length text.

aldoushuxley001 · on July 29, 2018

What are the downsides for implementing something like this rather than going for Elastic Search or SOLR?

Is it possible to have production-quality search using just postgresql?

rpedela · on July 29, 2018

For some production use cases, it is possible and others it is not or extremely difficult. There are a few reasons for this.

The quality of search is a combination of both precision, ordering by relevance, and recall, finding relevant documents in a corpus. PG FTS is primarily a recall engine since it does not have BM25, TF/IDF, etc built-in. Out of the box, it is like an easier to use regex rather than a full search engine. If you have a dataset with small N or relatively unique records then recall-based algorithms will still give good precision simply because there aren't that many results for a given query. However datasets with larger N or lots of similar records, precision becomes harder and you start needing BM25, machine learning, etc.

It is possible to incorporate other relevance signals (eg recency, popularity) to improve precision in PG but depending on the specific signal, it can be more difficult than ES or Solr. Things like auto complete, supporting multiple languages, spell check, scaling are also harder to implement in PG at the same quality and performance as ES/Solr.

Syncing the database and search server is a real problem so I usually recommend to stick with PG if you have a recall-based search problem. But if precision is important or one of those other features I mentioned is important than I recommend ES/Solr and just deal with the syncing problem because overall it will be easier.

ngrilly · on July 29, 2018

Great summary!

ngrilly · on July 29, 2018

The upside: only one service to maintain (PostgreSQL) instead of two (PostgreSQL + ElasticSearch) and not having to synchronize PostgreSQL and ElasticSearch (which is more difficult than it sounds).

The downside: ranking can be slow when your keywords match a large number of documents (have a look at github.com/postgrespro/rum for a solution), and faceting is more difficult to implement than in ElasticSearch

rbjorklin · on July 29, 2018

What about only running SOLR/Elasticsearch? That would keep the upside and eliminate the downside, no?

aisofteng · on July 29, 2018

Solr and Elasticsearch are not relational database systems and have none of the optimizations for relational data that Postgres does. They are built on top of Lucene and target document retrieval via an inverted index [1] rather than dealing with relational data.

[1] https://en.wikipedia.org/wiki/Inverted_index

vesak · on July 29, 2018

A rather small portion of software that requires search requires a relational database, though.

manigandham · on July 29, 2018

It's not just relational features that matter but the decades of work on reliability, performance, flexibility, security, and general usability of these databases that makes them so great.

ElasticSearch doesn't have ACID or very good OLTP manners. More importantly, it still continues to have data loss issues so it's not a good fit as a primary datastore.

smittywerben · on July 29, 2018

ElasticSearch looks like a REST api with http/json schema to pipe directly to browsers. Why would you need to store the results of ElasticSearch?

manigandham · on July 29, 2018

It's not about results, the data itself within ElasticSearch is not guaranteed safe as a primary datastore, so you should have a reliable database for the primary data that is then synced to ES.

aisofteng · on July 29, 2018

As someone who has used ElasticSearch in production for years, as far as I know this is simply not true.

manigandham · on July 29, 2018

That's good for you, but it sounds like you're extrapolating it on just your experience rather than research? Perhaps you've been lucky? Do you really have the info to objectively state it as "simply not true"?

Read the jepsen tests (old but show core problems that aren't fixed):

https://aphyr.com/posts/317-call-me-maybe-elasticsearch

https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-...

Or just search for ES data loss:

https://news.ycombinator.com/item?id=9475620

https://stackoverflow.com/questions/29841348/how-reliable-is...

And if that's not enough, here's the official page listing several "ongoing" fixes for major durability issues:

https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...

aisofteng · on July 30, 2018

After reading the links you gave, I’d like to clarify that I meant that edge cases that can cause data loss are not common enough, in my experience, for one to be forced to treat ElasticSearch as an unreliable data store. Anecdotally, I’ve never seen it happen. You’re right, maybe I’m lucky; I would like to see a case study of some project where these issues were significant enough to treat ElasticSearch as an unreliable data store.

smittywerben · on July 29, 2018

I don't understand the syncing. Why not write-only akin to UDP protocol?

manigandham · on July 29, 2018

What are you talking about?

purerandomness · on July 29, 2018

Can you name an example?

I've never built something that did not need a relational database to model the domain, and search is such a common UX element that it's taken for granted in everything you use.

I can't even think of something that could use search that's not relational. Can you?

vesak · on July 30, 2018

> I've never built something that did not need a relational database to model the domain

Yeah, so perhaps you don't have much experience from outside the relational model. Makes perfect sense to see a single truth then.

> I can't even think of something that could use search that's not relational. Can you?

Not sure how wide a blanket you're throwing by "not relational", but here are a couple examples that might fit: Google's search engine, most filesystems and their files, hierarchical and flat data structures in general, twitter.

threeseed · on July 29, 2018

Yes plenty.

Elasticsearch stores data in a JSON document store style format. So any situation where you need O(1) lookup for all data related to a particular entity you can have nested structures within JSON that facilitates this. In a RDBMS it is O(n) since you have to do a costly join for each embedded structure.

360 customer views are good example of this.

manigandham · on July 30, 2018

A document store is not a replacement for a relational database, and there are few cases where it's truly a better fit.

More importantly, ES is not a document store. It is a search system that can be used as a similarity index between all kinds of data, most commonly json docs but can be anything like images or audio waveforms.

ES can return none of the source data or not even store it in the first place. In addition, it lacks ACID and transactions and generally has an overall poor reputation for data integrity.

If you just want a json blob, you can already use a relational database column for it and get the key/value performance semantics, while still retaining all the reliability and manageability properties.

purerandomness · on July 29, 2018

There's a lot of misconceptions there.

If you want to save JSON documents, you can pick Postgres [1] or MySQL and still have a RDBMS for free additionally.

Elasticsearch does that too, but its access certainly isn't O(1) - it has a complexity of O(log n), just like Postgres or MySQL would have, because all of them use tree-like index data structures. (Please note that ElasticSearch is not meant as a data storage layer)

Now, if you just want to have the JSON, you're done. O(log n) for ElasticSearch, Postgres or MySQL. That's it.

But normally, you want to do something with that data. You have a user and want all the products. Or you have a book and want all the authors, because you have a real app to write.

Most data in our world is relational: A student can partitipate in courses, a store sells products, and so on. Queriyng your database using JOINs will make use of indexes in each of the tables that you have previously set up.

These indexes will prevent the theoretical worst-case (the "Full Table Scan", where your complexity is O(n) for the table). The query planner will decide which index is better to use for the optimal response time.

You can, however, choose to not leverage this feature of RDBMS. What do you do now, if you need all the courses of a student? Do you get all the JSONs and compare their IDs? Do you parse the JSON keys in your own app? But having your app iterate over all the JSON keys already has O(n) complxity. So you end up recreating tree-like indexes anyway somewhere in your stack. But all of that is already provided for you, with decades of debugging and performance optimization - for free.

To sum up: Use Postgres JSONB type if you only need to write JSON documents. You might later want to query the nested JSON data structure and join it. I highly recommend "Mastering PostgreSQL in Application Development" [2]

[1] https://www.postgresql.org/docs/current/static/datatype-json... [2] https://masteringpostgresql.com/

threeseed · on July 29, 2018

You have missed the point here.

The comparison was between ElasticSearch and a relational store. PostgreSQL's JSON type is not relational. And if you are fetching any entity by a primary key then it should be O(1). The point was that document stores offer you the ability to nest data structures whilst relational forces you to join. And so if that is your query/data pattern then a document store can be orders of magnitude faster than a purely relational store e.g. 360 Customer View.

purerandomness · on July 30, 2018

> O(1)

Primary key fetching will always be O(log n) due to how indexes are implemented. Access time complexity will never be independent of the item count in your data store, relational or not.

> PostgreSQL's JSON type is not relational.

It is. In Postgres, you can join nested JSON structures' keys with normal, relational tables: https://dba.stackexchange.com/questions/83932/postgresql-joi...

My point is: Postgres offers you a document store and gives you the option to join, if you feel like it.

lowercased · on July 30, 2018

There's not many use cases I come across where the 'nested' data is the only instance of that data. A student record might have some nested class information, but the class information itself lives as a master someplace else. If you nest things, you now have data living in multiple places, no? Updates need to update multiple locations, which would potentially introduce lots of errors.

I've got a handful of situations where nested structures are making sense, but it doesn't seem to be the norm (at least in the data worlds I work in).

eska · on July 29, 2018

There was a good article about that on here which you may find interesting http://blog.memsql.com/nosql

mozumder · on July 30, 2018

How would you build an order-processing & accounting system with only SOLR/Elasticsearch? Can you guarantee data reliability?

aisofteng · on July 30, 2018

>Is it possible to have production-quality search using just postgresql?

It depends on your search use case. If you are doing fairly simple searches such as trying to find, say, usernames, you can certainly have production grade search using Postgres.

However, if you have to handle searches that are not as targeted, Postgres will not be enough. One example is searching with synonyms - if you want a record containing "automobile" to match queries containing the term "car", ElasticSearch will serve you better.

Postgres is a relational data store and ElasticSearch is an information retrieval system; each focuses on their intended use case. The fact that Postgres supplies some elementary search capabilities does not mean that Postgres can well serve a bona fide information retrieval user case.

brightball · on July 30, 2018

In short, yes. As your data grows, you might outgrow PG to a more dedicated option but it is absolutely good enough for a while.

If you have a solution like Citus available to scale your DB horizontally, it might be enough forever though.

manigandham · on July 29, 2018

Depends on what you call production-quality, but definitely possible.

Simplicity of having it all in 1 system is great, however the search quality itself is usually average and configurability is limited. A real search engine will give you much better relevance rankings, facets, fuzzy search, and other fancy things you can do with a general "similarity index" like recommendations.