Out of curiosity, how is the 92% recall calculated? For a given query, is the recall compared to the true topk of all 100B vectors vs. recall at each of N shards compared to the topk of each respective shard?
(author here) The 92% mentioned in this post is showing recall@10 across all 100B vectors, calculated by comparing to the global top_k.
turbopuffer will also continuously monitor production recall at the per-shard level (or on-demand with https://turbopuffer.com/docs/recall). Perhaps counterintuitively, the global recall will actually be better than the per-shard recall if each shard is asked for its own, local top_k!
I've seen a decent amount of production use of pgvector HNSW from our customers on GCP, but as the author noted is not without some flaws and are typically in the smallish range (0-10M vectors) for the systems characteristics that he pointed out - i.e. build times, memory use. The tradeoffs to consider are whether you want to ETL data into yet another system and deal with operational overhead, eventual consistency, application-logic to join vector search with the rest of your operational data. Whether the tradeoffs are worth it really depends on your business requirements.
And if one needs the transactional/consistency semantics, hybrid/filtered-search, low latencies, etc - consider a SOTA Postgres system like AlloyDB with AlloyDB ScaNN which has better scaling/performance (1B+ vectors), enhanced query optimization (adaptive pre-/post-/in-filtering), and improved index operations.
Full disclosure: I founded ScaNN in GCP databases and currently lead AlloyDB Semantic Search. And all these opinions are my own.
The alternative is to find solutions that can reasonably support different requirements because business needs change all the time especially in the current state of our industry. From what I’ve seen, OSS Postgres/pgvector can adequately support a wide variety of requirements for millions to low tens of millions of vectors - low latencies, hybrid search, filtered search, ability to serve out of memory and disk, strong-consistency/transactional semantics with operational data. For further scaling/performance (1B+ vectors and even lower latencies), consider SOTA Postgres system like AlloyDB with AlloyDB ScaNN.
Full disclosure: I founded ScaNN in GCP databases and am the lead for AlloyDB Semantic Search. And all these opinions are my own.
One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?
It adds a small overhead to transactions. The exact number depends on how much you're inserting and indexing, but is in the milliseconds.
We have adopted strong consistency because we've observed most of our customers run ParadeDB pg_search in a separate instance from their primary Postgres, to tune the instance differently and pick more optimized hardware. The data is sent from the primary Postgres to the ParadeDB instance via logical replication, which is Zero-ETL. In this model, there are no transactions powering the app executed against the ParadeDB instance, only search queries, so this is a non-issue.
We had previously adopted weak consistency because we expected our customers to run pg_search in their primary Postgres database, but observed this is not what they wanted. This is specifically for our mid-market/enterprise customers. We may bring back weak consistency as an optional feature eventually, as it would enable faster ingestion.