Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Super Fuzzy Searching on PostgreSQL (bartlettpublishing.com)
123 points by dhruvbhatia on March 22, 2015 | hide | past | favorite | 10 comments


For people interested in implementing full-text search with PostgreSQL this could also help : "Postgres full-text search is Good Enough!" [0]

[0] http://blog.lostpropertyhq.com/postgres-full-text-search-is-...


That is a good article. I referenced it last week when I was converting a prototype Clojure app from using an embedded database and also Lucene to using just Postgres. Postgres is like a Swiss army knife, lots of functionality :-)


Past discussion on the above blog with author's comments : https://news.ycombinator.com/item?id=8381748


One thing that can help simplify the creation of the wordlist table is to instead use a materialized view (with a gin or gist trigram index), and just refresh it on some schedule.

Of course, this means that your fuzzy search capability will lag somewhat behind live data in the case where a new distinct word has just been added to the corpus, but in many cases this is acceptable: people looking for recently created records will most likely have some exact reference for them, and will only resort to using fuzzy search to find older records for which their reference is rather more, well, fuzzy.


And indexes are faster in 9.4 .

See also the smlar extension (by one of the text search authors). Let's you do TFIDF

I'd kill for BM25 and Solr style filters though...


It appears the author was only looking at gist indexes (which AFAICT haven't got faster in 9.4), rather than gin indexes (which have).

Gin indexes can certainly be used with the trigram extension, in exactly the same way as the gist ones, and you would generally tend to expect gin to be faster for queries, but slower for updates and inserts, compared to gist, though I've never done a serious comparison on trigram data.


In my experience the wisdom "GIN is slower to update than GiST" is mostly obsolete nowadays. At least that's what my benchmark results suggest http://blog.pgaddict.com/posts/performance-since-postgresql-...

It's also a bit shortsighted to look at the index maintenance in isolation - when your workload is 99% reads, and queries with GIN indexes are way faster, it's probably more efficient to use GIN indexes.

Moreover, there are often more expensive parts when inserting/updating data (either inside the database, or in the application), so the index maintenance is not really the main problem.

I'm sure there are workloads where it's still true, though.


Right. I've had one case where index rebuild times kind of mattered, and gist was definitely a lot faster at rebuilding, though not enough to make it worth taking the hit on read performance. But this was pre-9.4, so it may be rather different now.

The nature of data being indexed is probably the biggest determinant: gin is really happy when you have a small number of distinct values across a large dataset; I guess at the opposite end of the scale gist might come into it's own.


> I'd kill for BM25 and Solr style filters though...

In the meantime, I can recommend Sphinx Search: http://en.wikipedia.org/wiki/Sphinx_(search_engine)

(I am not related to it, just a happy user)


Indeed, <10 millisecond search results on 14TB is amazing (absolutely wrecked what we managed to get out of any other indexer - order of magnitude faster). Honestly, Sphinx is a little ugly to setup -- but it is an absolutely amazing piece of technology.

We are currently in the process of shifting everything to RT indexes and have been absolutely delighted with the new beta features.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: