Our top_k search in our own product is in Go and it's ~25 lines of code. Super simple stuff, does the job of funding the top_k in 1ms, no index, no framework, nothing. We benefit from the corpus we embed being relatively static and we can just store them in redis, so it's not a hard problem to solve. Anyways, often you don't need some turnkey solution to this stuff.
Same — I just asked ChatGPT to generate Go code to do cosine similarity and I was surprised when the code it generated worked and the worst performance I get is like 10ms
Yep. If you parallelize with goroutines it'll be faster, but we're talking about shaving a few ms. The effectiveness of a for loop crunching floats in a single thread is not to be underestimated.
Size can be up to ~140MB of vectors, so not exactly enormous. But it's nowhere near enough to stress a single node running a for loop on the data and doing a dot product, then sorting by relevancy.
It shouldn't be too hard to extend this to a fairly efficient exact k-nearest neighbor search (at least for Euclidean distance).
The classic approach is to keep your k candidates in a max heap ordered by distance (replacing the farthest candidates whenever the heap is full and you find nearer items in the leaves), and visit the far side of the hyperplane after visiting the near side if either the distance to the hyperplane is less than the distance to the farthest item in the heap or you don't yet have k candidates.
A way to visualize it is that your k candidates define a ball (with radius defined by the worst candidate) around the search point that shrinks as you find better candidates. Then you recursively visit all the nodes in the tree that the ball still touches. Any nodes outside the ball can't possibly improve on the candidates within it and so can be excluded.
In graphics rendering, this sort of thing was the approach used by photon mapping [0] to find the closest photons for density estimation.
does that scale to high-dimensions? IIUC the photon mapping procedure typically uses a kd-tree to find closer/farther candidates, and if you consider something like ada embeddings (1536 dim), that would be very computationally intense. photon mapping feels like a 3D problem
This random hyper plane algorithm is actually a pretty elegant algorithm. It has the following amazing property: as your dimension increases, the quality of the random hyper planes actually gets better!
It is inefficient to build (unless on GPU), but good for search. Moreover, the search naturally benefits from data locality after ordering - this allows using MergeTree tables in ClickHouse. See https://news.ycombinator.com/item?id=36341754
That's the thing though, it's very easy to perform approximate nearest neighbor search in an efficient way. There are plenty of simple solutions for that. The real challenge is to design an algorithm that can scale & that remains robust even when there are additions and deletions to the set.
> The approach we use here is based on a family of algorithms called "Locality Sensitive Hashing" used in the popular library annoy. This article's goal isn't to introduce a new fancy algorithm/library but to describe how vector search works using real code snippets.
Always pleasure to see short implementations, but I’d still take HNSW over such approach. Its core implementation can also be quite compact. In our case with USearch [1] its about 2k LOC, even with a custom implementation of priority queues and custom numeric types like uint40_t for indexes beyond 4B points.
Obviously, together with bindings for Rust, Python, JavaScript, Java, Objective-C, Swift, and Wolfram it gets slightly larger, but on the bright side, you can pass JIT-compiled functions and store/search elements of different dimensionality, which allows much broader applicability [2].
Why? Sorry, but your comment is not very insightful without more explanation, and the rest of your comment is just an advertisement for your (admittedly impressive looking) work.
Because (A) it allows similarity search between auxiliary objects, not just equidimensional vectors, and (B) can be implemented in a similarly short fashion, while providing great performance.
FWIW exhaustive search is still probably good enough for most use cases. IMO the exhaustive search should use a heap https://github.com/fennel-ai/fann/blob/main/src/main.rs#L12 as you're only looking for top-k, it reduces time and memory complexity greatly. On a relatively unoptimized Golang implementation (though much beefier hardware) I get ~100ms to process 1M vectors of 300 dim. Still quite a bit slower than approximate, of course, but in absolute terms probably good enough for most use cases, especially because many use cases don't have 1M vectors.
I've recently presented the approach with exhaustive search and heap, then iteratively adding simple indices, using ClickHouse. It works pretty well despite the overall usage simplicity.
nice, if I'm reading this right, you do random projection as a first stage filter, then exact search over the filtered subset? that's clever, did you test precision/recall at all?
Yes, this is the idea. I set the threshold for approximate search with a margin so that on a few test queries, I get 100% recall, and then it is filtered by precise search.
But I didn't evaluate it thoroughly. The threshold for Hamming distance over bit masks was set almost arbitrarily, and it will be interesting to test further.
The hyperplanes are randomized to ensure the hyperplanes are probabilistically independent and to deal with high dimensionality, so the data are less likely to be clustered in the same bucket.
Nice, this is a cool version of ANN search. I like that at the end there is commentary on what's needed for production as well - things like parallelization, RAM considerations, and the consideration for balancing the trees. It's really the production level considerations that would steer you toward a vector database like Milvus/Zilliz
No, this is only true if that hyperplane contains the origin; imagine an infinite number of hyperplanes that contain A and B; there are an infinite number of such planes for dimensions higher than 2. Now imagine the same, but connecting O and C; most of those AB hyperplanes are not orthogonal to those OC hyperplanes, it’s only coincidence if they are, though for dimensions higher than 2 you can always find a point C that happens to lie along the orthogonal line from the AB plane to the origin.
No, it's more restricted than that. Think of 2d planes in 3 space: There are an infinite number of planes that you can construct on a line from A to C (they're all rotations of a plane extending out from that line).
Okay I'm trying to map this visually in my brain. So, in 3 space, say I extrude a random plane out from a line segment. I keep going, now I imagine I've extruded every possible plane approaching infinity. This eventually forms... a cylinder, no? So a hyperplane is what, a cylinder? Or is it shorthand for "every possible plane..."
In an n-dimensional space, an infinite n-1-dimensional thing is a hyperplane. Or put more simply: it's something that divides the space in two. In 1 dimension that's a point, in 2 dimensions it's a line, in 3 dimensions it's a plane (hence the name), for more dimensions we just call it a hyperplane.
Now there are various ways to define which hyperplane you speak about. The easiest to reason about is to give a couple points that are on the hyperplane (1 point defines a point, a line can be defined by any two points on that line, a plane by three points on that plane etc.). A slightly weirder but equally useful way is to say "it's orthogonal to this line, and intersects this point". That's easy to reason about in three dimensions, but apparently works in higher dimensions as well.
I couldn't find anywhere in the article, how long does it take to add all 10,000 embeddings to the index? That seems quite important for many applications
We did some benchmarking and found that an index of 1M vectors using 15 trees can be built up in ~115 seconds (8 CPU, 8 GB RAM). https://github.com/fennel-ai/fann/pull/8.
Given this, around a second to build an index of 10K items sounds about right.
The article isn't meant to present a new library competing with Annoy or other state of the art approaches, but instead, describe how vector search works using short code snippets.
I've now updated the article to clearly mention this and also cite annoy.
My favorite kind of pedantry is pedantry that's naive - on like day 1 of topology 101 you learn that the "dot" product is an inner product and every inner product naturally induces a metric so every inner product space is a metric space (with the induced metric).
It’s not pedantry. Just because an inner product induced a metric space does it make it the same thing as that metric. The performance of similarity searches can depend very much on whether you use a dot product or a distance.
A distance is a function d(x, y) where d(x, y) = 0 if x = y, if x = 0, or if y = 0 (among other properties). If x = y and neither x = 0 nor y = 0, and if we let d(x, y) = x*y, then d(x, y) != 0 in general, so the dot product is certainly not a distance. Your comment isn't relevant, but thanks for calling me a pedant. ;-)
Given normalized vectors, it's also proportional to the square of the Euclidian distance. Meaning for ranking (and thus k-nearest-neighbors) it doesn't matter if you use Euclidian distance or cosine similarity, you get the same result.
Unit vectors have a normalized length, so the only difference between two unit vectors is the "heading". The angle between them can be interpreted as the the "difference" or distance, between them.
Since you can compose any vector in the space out of unit vectors, you can extend the concept. See above comment inner product -> metric space.
What is a proper distance function is (1 - |a*b|). That's proportional to the distance between those points on the plane spanned along the circumference of the circle/sphere/hypersphere as it's projected onto that plane.
The dot product of two unit vectors is the cosine of the angle between them. Since cos^-1 (u*v) is the angle between them, or the geodesic distance between u and v regarded as points on the sphere, which is a distance. But the dot product is not.
If u = v, then u*v = 1, but for two points not equal to zero, the distance between them should always be zero. Remember that u*v = 0 if u and v are perpendicular. If neither u nor v are 0, then for them to be perpendicular, they must certainly not be equal. The dot product does not behave like a distance.