Hacker News new | past | comments | ask | show | jobs | submit login
FANN: Vector Search in 200 Lines of Rust (fennel.ai)
163 points by ingve on June 15, 2023 | hide | past | favorite | 71 comments



Our top_k search in our own product is in Go and it's ~25 lines of code. Super simple stuff, does the job of funding the top_k in 1ms, no index, no framework, nothing. We benefit from the corpus we embed being relatively static and we can just store them in redis, so it's not a hard problem to solve. Anyways, often you don't need some turnkey solution to this stuff.


Same — I just asked ChatGPT to generate Go code to do cosine similarity and I was surprised when the code it generated worked and the worst performance I get is like 10ms


Yep. If you parallelize with goroutines it'll be faster, but we're talking about shaving a few ms. The effectiveness of a for loop crunching floats in a single thread is not to be underestimated.


I have gotten 10x speedups with SIMD on modern hardware - goes from fast to blazing. https://github.com/stillmatic/gollum/blob/main/vectorstore.g...

FWIW - Weaviate's pure Go implementation of cosine similarity is probably as fast as you can possibly get in pure go, without SIMD -- https://github.com/weaviate/weaviate/blob/2f44dbee52196656bd...


Weaviate actually does use SIMD for AMD64 with Go assembly (cosine distance is just normalized dot product) https://github.com/weaviate/weaviate/blob/master/adapters/re...


Yep, we'd go with what they did, except OpenAI embedding vectors are already normalized, so you can just do the dot product.


Do you need anything special to run a Rust program with SIMD enabled?


What is the size and dimensionality of your corpus?


Size can be up to ~140MB of vectors, so not exactly enormous. But it's nowhere near enough to stress a single node running a for loop on the data and doing a dot product, then sorting by relevancy.


It shouldn't be too hard to extend this to a fairly efficient exact k-nearest neighbor search (at least for Euclidean distance).

The classic approach is to keep your k candidates in a max heap ordered by distance (replacing the farthest candidates whenever the heap is full and you find nearer items in the leaves), and visit the far side of the hyperplane after visiting the near side if either the distance to the hyperplane is less than the distance to the farthest item in the heap or you don't yet have k candidates.

A way to visualize it is that your k candidates define a ball (with radius defined by the worst candidate) around the search point that shrinks as you find better candidates. Then you recursively visit all the nodes in the tree that the ball still touches. Any nodes outside the ball can't possibly improve on the candidates within it and so can be excluded.

In graphics rendering, this sort of thing was the approach used by photon mapping [0] to find the closest photons for density estimation.

[0] https://en.wikipedia.org/wiki/Photon_mapping


One of the authors of the post here. I've never heard of this approach but it sounds really cool. Will definitely try it out someday!


does that scale to high-dimensions? IIUC the photon mapping procedure typically uses a kd-tree to find closer/farther candidates, and if you consider something like ada embeddings (1536 dim), that would be very computationally intense. photon mapping feels like a 3D problem


This random hyper plane algorithm is actually a pretty elegant algorithm. It has the following amazing property: as your dimension increases, the quality of the random hyper planes actually gets better!


What exactly is elegant about this algorithm? The approach seems computationally quite inefficient to me.


It is inefficient to build (unless on GPU), but good for search. Moreover, the search naturally benefits from data locality after ordering - this allows using MergeTree tables in ClickHouse. See https://news.ycombinator.com/item?id=36341754

And it is around 10 lines of code in SQL: https://presentations.clickhouse.com/meetup74/ai/#35


That's the thing though, it's very easy to perform approximate nearest neighbor search in an efficient way. There are plenty of simple solutions for that. The real challenge is to design an algorithm that can scale & that remains robust even when there are additions and deletions to the set.


That presentation was pretty cool, thanks for posting.


How did they manage to write this article without once mentioning the pioneering LSH algorithm from 1999?

https://en.wikipedia.org/wiki/Locality-sensitive_hashing


> The approach we use here is based on a family of algorithms called "Locality Sensitive Hashing" used in the popular library annoy. This article's goal isn't to introduce a new fancy algorithm/library but to describe how vector search works using real code snippets.

Right at the top...


The article was edited in response to this thread: https://news.ycombinator.com/item?id=36344870


Yes, it wasn't initially there mostly due to an oversight but we later edited the article.


Oh ok, cool you're checking feedback and updating. Might put in an 'updated at' or something :P. But thanks, enjoyed the post!


My bad! That's some fast updating


Always pleasure to see short implementations, but I’d still take HNSW over such approach. Its core implementation can also be quite compact. In our case with USearch [1] its about 2k LOC, even with a custom implementation of priority queues and custom numeric types like uint40_t for indexes beyond 4B points.

Obviously, together with bindings for Rust, Python, JavaScript, Java, Objective-C, Swift, and Wolfram it gets slightly larger, but on the bright side, you can pass JIT-compiled functions and store/search elements of different dimensionality, which allows much broader applicability [2].

[1]: https://github.com/unum-cloud/usearch [2]: https://ashvardanian.com/posts/abusing-vector-search


> I’d still take HNSW over such approach

Why? Sorry, but your comment is not very insightful without more explanation, and the rest of your comment is just an advertisement for your (admittedly impressive looking) work.


Because (A) it allows similarity search between auxiliary objects, not just equidimensional vectors, and (B) can be implemented in a similarly short fashion, while providing great performance.


I'd be curious how this stacks up on ANN-benchmarks (https://ann-benchmarks.com/).

FWIW exhaustive search is still probably good enough for most use cases. IMO the exhaustive search should use a heap https://github.com/fennel-ai/fann/blob/main/src/main.rs#L12 as you're only looking for top-k, it reduces time and memory complexity greatly. On a relatively unoptimized Golang implementation (though much beefier hardware) I get ~100ms to process 1M vectors of 300 dim. Still quite a bit slower than approximate, of course, but in absolute terms probably good enough for most use cases, especially because many use cases don't have 1M vectors.


I've recently presented the approach with exhaustive search and heap, then iteratively adding simple indices, using ClickHouse. It works pretty well despite the overall usage simplicity.

Video: https://www.youtube.com/watch?v=hGRNcftpqAk

Presentation: https://presentations.clickhouse.com/meetup74/ai/

PS. It should work at the same speed as "bruteforce-blas" in the ANN benchmarks.


nice, if I'm reading this right, you do random projection as a first stage filter, then exact search over the filtered subset? that's clever, did you test precision/recall at all?


Yes, this is the idea. I set the threshold for approximate search with a margin so that on a few test queries, I get 100% recall, and then it is filtered by precise search.

But I didn't evaluate it thoroughly. The threshold for Hamming distance over bit masks was set almost arbitrarily, and it will be interesting to test further.


Isn't this kd-trees with arbitrarily oriented (not axis-aligned) hyperplanes?


The hyperplanes are randomized to ensure the hyperplanes are probabilistically independent and to deal with high dimensionality, so the data are less likely to be clustered in the same bucket.

https://en.wikipedia.org/wiki/Random_projection


It's a binary space partitioning (BSP) [0]. A kd-tree is a special case of a BSP with axis-aligned hyperplanes.

[0] https://en.wikipedia.org/wiki/Binary_space_partitioning


https://www.researchgate.net/profile/Vivek-Kumar-Bagaria-2/p...

This idea improves the scaling from linear in dimension to logarithmic when very little loss using ideas from multi-arm bandits


The actual arXiv link for the curious: https://arxiv.org/abs/1805.08321


Nice, this is a cool version of ANN search. I like that at the end there is commentary on what's needed for production as well - things like parallelization, RAM considerations, and the consideration for balancing the trees. It's really the production level considerations that would steer you toward a vector database like Milvus/Zilliz


"Build a hyperplane (analog of a "line" in higher dimensions) that passes through C and is perpendicular to the line segment connecting A and B."

Isn't that just a fancy way of saying a line from O (a tuple of zeroes) to C to infinity?


No, this is only true if that hyperplane contains the origin; imagine an infinite number of hyperplanes that contain A and B; there are an infinite number of such planes for dimensions higher than 2. Now imagine the same, but connecting O and C; most of those AB hyperplanes are not orthogonal to those OC hyperplanes, it’s only coincidence if they are, though for dimensions higher than 2 you can always find a point C that happens to lie along the orthogonal line from the AB plane to the origin.


No, it's more restricted than that. Think of 2d planes in 3 space: There are an infinite number of planes that you can construct on a line from A to C (they're all rotations of a plane extending out from that line).


Okay I'm trying to map this visually in my brain. So, in 3 space, say I extrude a random plane out from a line segment. I keep going, now I imagine I've extruded every possible plane approaching infinity. This eventually forms... a cylinder, no? So a hyperplane is what, a cylinder? Or is it shorthand for "every possible plane..."


In an n-dimensional space, an infinite n-1-dimensional thing is a hyperplane. Or put more simply: it's something that divides the space in two. In 1 dimension that's a point, in 2 dimensions it's a line, in 3 dimensions it's a plane (hence the name), for more dimensions we just call it a hyperplane.

Now there are various ways to define which hyperplane you speak about. The easiest to reason about is to give a couple points that are on the hyperplane (1 point defines a point, a line can be defined by any two points on that line, a plane by three points on that plane etc.). A slightly weirder but equally useful way is to say "it's orthogonal to this line, and intersects this point". That's easy to reason about in three dimensions, but apparently works in higher dimensions as well.


No, rotate by 90 degrees. The plane is normal to the line, so the line doesn’t lie in the plane.


This is great! See also Voy, A WASM vector similarity search written in Rust:

https://github.com/tantaraio/voy


One of the authors of the post here. Happy to take any questions!


Wondering how much details you have considered on using Johnson-Lindenstrauss transform for dimensionality reduction on ANN search?

e.g. https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pd...


We hadn't considered that at all, will definitely check it out and explore further. Thanks!


I couldn't find anywhere in the article, how long does it take to add all 10,000 embeddings to the index? That seems quite important for many applications


It should take less than one second on CPU.

For example, see my comment above: https://news.ycombinator.com/item?id=36341754


We did some benchmarking and found that an index of 1M vectors using 15 trees can be built up in ~115 seconds (8 CPU, 8 GB RAM). https://github.com/fennel-ai/fann/pull/8.

Given this, around a second to build an index of 10K items sounds about right.


The code for the Hyperplane type is unfortunately omitted. It’s the most important code in the whole thing.



Yes, but it should be in the article if everything else is.


Apologies, that got omitted by mistake. I've now included the post to include that piece of code as well.


How is this different than ANNOY?


It is cited, but buried half-way into the article. There is a link in it that leads to https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...


The article isn't meant to present a new library competing with Annoy or other state of the art approaches, but instead, describe how vector search works using short code snippets.

I've now updated the article to clearly mention this and also cite annoy.


The dot product isn’t a distance


My favorite kind of pedantry is pedantry that's naive - on like day 1 of topology 101 you learn that the "dot" product is an inner product and every inner product naturally induces a metric so every inner product space is a metric space (with the induced metric).


It’s not pedantry. Just because an inner product induced a metric space does it make it the same thing as that metric. The performance of similarity searches can depend very much on whether you use a dot product or a distance.


A distance is a function d(x, y) where d(x, y) = 0 if x = y, if x = 0, or if y = 0 (among other properties). If x = y and neither x = 0 nor y = 0, and if we let d(x, y) = x*y, then d(x, y) != 0 in general, so the dot product is certainly not a distance. Your comment isn't relevant, but thanks for calling me a pedant. ;-)


`is != induces`


Dot product is equivalent to cosine similarly for normalized vecotrs


Given normalized vectors, it's also proportional to the square of the Euclidian distance. Meaning for ranking (and thus k-nearest-neighbors) it doesn't matter if you use Euclidian distance or cosine similarity, you get the same result.


It isn't equivalent to it. See my other comment.


Dude.

Cos(x, y) = x . y / ||x|| * ||y||

If x and y are normalized:

Cos(x, y) = x . y


x*y/|x||y| is a very different function from x*y

It also isn't a distance...


It is for unit vectors.


My math is weak. I thought dot product measures alignment. Help me grok what "distance" means here?


Unit vectors have a normalized length, so the only difference between two unit vectors is the "heading". The angle between them can be interpreted as the the "difference" or distance, between them.

Since you can compose any vector in the space out of unit vectors, you can extend the concept. See above comment inner product -> metric space.


Hmm, I think I got my math wrong too.

What is a proper distance function is (1 - |a*b|). That's proportional to the distance between those points on the plane spanned along the circumference of the circle/sphere/hypersphere as it's projected onto that plane.


The dot product of two unit vectors is the cosine of the angle between them. Since cos^-1 (u*v) is the angle between them, or the geodesic distance between u and v regarded as points on the sphere, which is a distance. But the dot product is not.

If u = v, then u*v = 1, but for two points not equal to zero, the distance between them should always be zero. Remember that u*v = 0 if u and v are perpendicular. If neither u nor v are 0, then for them to be perpendicular, they must certainly not be equal. The dot product does not behave like a distance.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: