Just curious what the state of the art around filtered vector search results is? I took a quick look at the SPFresh paper and didn't see it specifically address filtering.
In any API service, it's better to handle via dependency injection IMO.
Instantiate all of your metadata once, and then send that logger down, so that anybody who uses that logger is guaranteed to have the right metadata... the time to add logging is not when you are debugging.
I don't disagree that rock solid is a good choice, but there is a ton of innovation necessary for data stores.
Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.
This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.
I do not think data stores are a bottleneck for serving embedding search. I think the raft of new-fangled vector db services (or pgvector or whatever) can be a bottleneck because they are mostly optimized around the long tail of pretty small data. Real internet-scale search systems like ES or Vespa won’t struggle with serving embedding search assuming you have the necessary scale and time/money to invest in them.
* Filterable ANN certainly decomposes into pre- and post-filtering, and there is definitely a lot of interesting innovation occurring around filterable ANN. But large-scale search systems currently do a pretty good job with pre-filtering, falling back to brute force search in the case of restrictive filters.
* You'd have to be a bit more exact re: dynamic updates/versioning for me to understand the challenges you're facing.
* Building graph indices can be slow, but in my experience (billions of embeddings) it is possible to build HNSW indices in tens of minutes.
* How is this any different to combining traditional keyword search with, say, recency boosting?
Might be missing my argument here - I stated that there are workable solutions to this like you have pointed out.
But ANN search is still a sledgehammer and building out hybrid solutions that help bridge the gap between this and traditional data stores still have room for innovation.
Fair enough - agreed there's lots of interesting innovations here - but my point is that semantic search and its associated issues don't really differ that much from other types of search problems at scale, and I therefore don't think that the current crop of vector database products add a lot of value from a technical perspective (perhaps they do from an ease-of-use perspective; or they work great at small scale, etc. etc.)
Oh, then you must have the secret sauce that allows scaling ES vector search beyond 10,000 results without requiring infinite RAM. I know their forums would welcome it, because that question comes up a lot
Or I guess that's why you included the qualifier about money to invest
Would you mind putting aside the snark? I have a couple questions. How large is the corpus? I am also curious about the use-case for top-k ANN, k > 10000?
Not the person you have asked but at work (we are a CRM platform) we allow our clients to arbitrarily query their userbase to find matching users for marketing campaigns (email, sms, whatsapp). These campaigns can some times target a few hundred thousand people. We are on a really ancient version of ES, but it sucks at this job in terms of throughput. Some experimenting with bigquery indicates it is so much better at mass exporting.
Fair; my question was mostly in the context of ANN, since that was the discussion point - I have to assume ES (as a search engine) would not necessarily be the right tool for data warehousing types of workloads.
A fascinating memoir by a philosopher turned brain surgeon, facing a terminal cancer diagnosis. A person who spent their entire life pondering the morality of life being faced with their own ultimatum.
I reread it once a year, at minimum. A deeply moving book.
If you look for just pure vector similarity search, there are many alternatives. But Vespa's tensor support, multi-vector indexing and the ability to express models like colBERT (1) or cross-encoders makes it stand out if you need to move beyond pure vector search support.
Plus, for RAG use cases, it's a full blown text search engine as well, allowing hybrid ranking combinations. Also with many pure vector databases like Pinecone, you cannot describe an object with more than one vector, if you have different vector models for the object, you need different indexes, and then duplicate metadata across those indexes (if you need filtering + vector search).
Yeah, my assumption was that something in some layer of their application isn’t well optimized when asked to return posts from a subreddit that has “gone dark” in whatever fashions the mods chose to do that.
For example, maybe it causes reads from the database take a lot longer than they normally would, locking up the database or causing the process the crash (again, that’s just pure speculation).
one I've been wondering about is user overview pages. People use those a lot (it's actually my bookmark for getting onto reddit) and yesterday I noticed that a post I made wasn't in my overview, and it's because that sub had gone dark early.
What happens when a user has 99% of their posting in subs that are now hidden, and the API is programmed to produce a fixed 30 comments of history on the overview page? The answer is extremely deep database pulls... you might pull a year of comment history to get 30 comments that aren't hidden. And depending on how they do that, it may actually pull the whole comment history for that timespan, since most of the time posts aren't hidden like this.
I worked at a backend team at work with some very overburdened legacy tables in mongo, and this is the kind of thing we'd think about. Yeah you can use an index, but then you have to maintain the index for every record, and change it every time a sub goes private/public (and we literally were hitting practical limits on how many indexes we could keep, we finally instituted a 1-in-1-out rule). And how often does that happen? Even deleted comments are overall probably a minority such that indexes don't matter, but, this is relational data, you have to know which subreddits are closed before you can filter their results, and mongo sucks at joins. And the mongo instance can become a hotspot, so, just filter it in the application instead for those "rare" instances. Even if they are doing it in mongo, the index/collection they're joining may suddenly be 100x the size, which could blow stuff up anyway.
edit: for me, one overview page is now taking me back one month in comment history. And I comment a lot on subs that are currently closed, so it could easily be throwing away 5-10 comments for every comment it displays.
I'm guessing hit on the open subreddit mostly goes directly out of caching layer while hit on private one incurs DB hit to check whether user belongs there