Super Fuzzy Searching on PostgreSQL

aw2611 · on March 22, 2015

For people interested in implementing full-text search with PostgreSQL this could also help : "Postgres full-text search is Good Enough!" [0]

[0] http://blog.lostpropertyhq.com/postgres-full-text-search-is-...

mark_l_watson · on March 22, 2015

That is a good article. I referenced it last week when I was converting a prototype Clojure app from using an embedded database and also Lucene to using just Postgres. Postgres is like a Swiss army knife, lots of functionality :-)

chdir · on March 22, 2015

Past discussion on the above blog with author's comments : https://news.ycombinator.com/item?id=8381748

radiowave · on March 22, 2015

One thing that can help simplify the creation of the wordlist table is to instead use a materialized view (with a gin or gist trigram index), and just refresh it on some schedule.

Of course, this means that your fuzzy search capability will lag somewhat behind live data in the case where a new distinct word has just been added to the corpus, but in many cases this is acceptable: people looking for recently created records will most likely have some exact reference for them, and will only resort to using fuzzy search to find older records for which their reference is rather more, well, fuzzy.

elchief · on March 22, 2015

And indexes are faster in 9.4 .

See also the smlar extension (by one of the text search authors). Let's you do TFIDF

I'd kill for BM25 and Solr style filters though...

radiowave · on March 22, 2015

It appears the author was only looking at gist indexes (which AFAICT haven't got faster in 9.4), rather than gin indexes (which have).

Gin indexes can certainly be used with the trigram extension, in exactly the same way as the gist ones, and you would generally tend to expect gin to be faster for queries, but slower for updates and inserts, compared to gist, though I've never done a serious comparison on trigram data.

pgaddict · on March 22, 2015

In my experience the wisdom "GIN is slower to update than GiST" is mostly obsolete nowadays. At least that's what my benchmark results suggest http://blog.pgaddict.com/posts/performance-since-postgresql-...

It's also a bit shortsighted to look at the index maintenance in isolation - when your workload is 99% reads, and queries with GIN indexes are way faster, it's probably more efficient to use GIN indexes.

Moreover, there are often more expensive parts when inserting/updating data (either inside the database, or in the application), so the index maintenance is not really the main problem.

I'm sure there are workloads where it's still true, though.

radiowave · on March 22, 2015

Right. I've had one case where index rebuild times kind of mattered, and gist was definitely a lot faster at rebuilding, though not enough to make it worth taking the hit on read performance. But this was pre-9.4, so it may be rather different now.

The nature of data being indexed is probably the biggest determinant: gin is really happy when you have a small number of distinct values across a large dataset; I guess at the opposite end of the scale gist might come into it's own.

frik · on March 22, 2015

> I'd kill for BM25 and Solr style filters though...

In the meantime, I can recommend Sphinx Search: http://en.wikipedia.org/wiki/Sphinx_(search_engine)

(I am not related to it, just a happy user)

MetaCosm · on March 22, 2015

Indeed, <10 millisecond search results on 14TB is amazing (absolutely wrecked what we managed to get out of any other indexer - order of magnitude faster). Honestly, Sphinx is a little ugly to setup -- but it is an absolutely amazing piece of technology.

We are currently in the process of shifting everything to RT indexes and have been absolutely delighted with the new beta features.