Thanks for writing this one Simon, I read it some time ago and I just wanted to say thanks and recommend it to folks browsing the comments, it's really good!
Nicely written and very approachable. Might add a paragraph on why to use cosine similarity, as it gives a chance to illustrate how the n-dimensional vector embedding is used.
I still don't understand why cosine similarity is used (other than maybe being cheap to compute). I've read a bunch of hand-wavey articles but none of them add up for me.
If all your vectors have the same 2-norm, the cosine similarity of two vectors is a monotonic function of their euclidean distance (i.e. the minimum in one measure will be the minimum in the other).
If all your vectors don't have the same norm, where's the sense in projecting them onto the unit sphere and measuring the angle between them? You're throwing away important information about the embedding.
I have to say, neither of those answers are very satisfying. Neither is wrong but I don't come away with a clear explanation for why cosine similarity is the obvious metric to use.
For euclidean distance, we have (q-v)^2 = qq + vv - 2qv. The qq and vv don't tell us anything about the relationship between q and v, so are irrelevant to search. So really we are mostly trying to maximize inner product when we search.
Additionally, the scale of q (the query) is constant across any given search, so might as well normalize q; the results won't change.
Then the only question is whether the scale of v matters... It, again, doesn't tell us anything about the relationship between q and v. So, might as well normalize... And then we are just doing cosine similarity.
Cosine similarity is not affected by the size of the vector but only by the angle between them. This means that vectors with large or small values will have the same cosine similarity as long as they point in the same direction.
In semantic search the magnitude of vectors isn’t relevant - if you needed a measure that took magnitude into account than Euclidean would make sense. E.g. image embeddings based on pixel intensity.
It's not obvious to me that vector size isn't relevant in semantic search. What is it about the training process for semantic search that makes that the case?
Think of partitioning the space by taking random hyperplanes. This can be computed by computing the dot product of the normal to the plane with any given point to see which side of the plane the point is on.
For two points that are very close to each other in Euclidean distance, the chance that a random hyperplane gets right in the middle of them is very low. This is, of course, the angle between as measured from the origin, but that angle gets very small the closer the two points are to each other, in Euclidean norm, and the further they are from the origin. By bisecting the space randomly and seeing which side points are on, we get a decent hashing function, binning similar points together.
Now, a point could be very close the origin and have a small angle to another point very far from the origin but, in some sense, there needs to be a conspiracy for this to happen. If you assume some kind of random distribution of points in D dimensional space, drawn in some sort of uniform like distribution in the D dimensional hyper cube or hyper sphere, the chances that a point will be very near the origin go down, as does the chance it'll lie in a thin cone within some other point.
I believe the basic assumption here is that independent features will be randomly distributed where as correlated features will be clumped together (in Euclidean distance). The randomness pushes unrelated points away from the origin and gives decreasing probability that the angle between them is low.
If points aren't random enough, maybe many of them are very close to the origin while others aren't, or maybe there's some other dependence that causes problems, then this could foil the hyperplane trick.
I believe I heard this basic argument from a talk on locality-sensitive hashing [0]. I think I have the basic intuition, but maybe that talk will be more enlightening.
The "Do you need a Dedicated Vector Database?" slide is quite interesting, but doesn't really answer its own question! This is something I've been wondering myself, if anyone has any guidelines or rules of thumb I would appreciate it.
I've recently been using Simon Willison's (hi simonw) excellent `llm` tool that can help with embeddings, and it takes the simplest approach possible: just store embeddings in SQLite with a few UDFs for calculating distance etc.
The simplicity of that approach is very appealing, but presumably at some level of traffic+data an application will outgrow it and need a more specialized database. Does anyone have a good intuition for where that cutoff might be?
Some years ago I wrote a vector index that used LSH. Search was implemented in the most trivial way: scanning everything and comparing using hamming distance (xor and popcount). I was able to scan 200k hashes and find the most similar in less than 10ms, in a single core of a 2011 MBP.
I’ve had some success scaling a quick and dirty engine (that powers findsight.ai) to tens of millions of vectors, details in the talk here: https://youtu.be/elNrRU12xRc?t=1556
That’s maybe 1kLOC so I didn’t need an external one after all.
Honestly just loading all vectors in-memory (and stuff like sqlite, pgvector) is totally fine when you're dealing with O(100k) vectors, but beyond that all the workable options like pinecone get gnarly, slow, and ridiculously expensive.
The best option by far I know of is turbopuffer.com , which is like 100x cheaper than pinecone and seems to actually scale.
Since it's not listed in the suggested vector dbs section of the slides, wanted to lob it in as a solid suggestion :)
YMMV, of course, but this[1] article - posted here a few weeks back - really helped us to focus on how to think about what we actually needed: in short, a search engine.
A few months ago, I taught a class on Vector Databases for a TGE Data private client and then decided to record it into a short course for a wider audience.
The course is a mix of theory and demos discussing some of the underlying concepts of Vectors, Vector Databases, Indexing, Search Similarity and ending with demos specifically for Pinecone and Weaviate databases.
I already have a Udemy account so when I went to checkout and logged in it said the promo was for new users only. But I went back to the course page again and reloaded the page so that I was logged in and then I was able to buy it with the coupon applied.
A nice introduction. However, I feel that many such introductions, it skips too quickly over the question of feature selection.
This is the step that introduces, rather subtly and sometimes almost ignorably, human judgement into what otherwise feels like a very automated and "it's just math" system.
Let's take audio. What features will you select from it to make your N-dimensional vector? An easy answer might appear to be "as many as possible". However, your first problem is that you may not have access to feature characterization even for just the features you can trivially name. Your second problem is that without deep knowledge of the domain, you may not even be aware of potential features that you should probably be using. Your third problem is that even with deep knowledge of the domain, you may still be unaware of potential features that you probably should be using.
A practical example may help. Perhaps you're a fan of Reichian phase-shifting minimalist music. You (indirectly) interrogate a vector database to find music similar to a canonical piece from this genre (say, Piano Phase). The database uses many audio & music related features, including but not limited to: dominant frequency, note onset intervals, volume, frequency-distribution-based timbral characteristics, apparent root notes and scales etc etc etc.
Unfortunately for you, the feature set in the database does not include "note-to-note intervals over time is constant". Your query will manage to find things that are timbrally, harmonically, melodically and rythmically similar, but it will be by sheer good luck that any of them share the key characteristic: steady changes in the relative phase of two different (or the same) melodic lines.
This is just one example. It's not hard to cook up equivalents for visual, textual, numerical ... any data at all.
Now of course, none of this makes vector databases and feature classification useless. It does mean that when you do or do not find match patterns in a given data set, one of the early questions you need to ask yourself is whether the feature set has any strong guarantee of being complete, and if not, how it may need to be expanded.
Vector databases are oriented around search and retrieval. The usual approach of generating vectors is to fine tune a large pretrained model and extract the inner representations. Because the dataset contains successful queries and retrieval results, all you need to do is optimize a loss function (usually contrastively or approximated versions of common ranking functions) using raw inputs on the similarity objective supported by the vector database. For common modalities like tabular/text/image/audio data, there is basically no human judgement involved in feature selection - just apply attention.
Note: current state of the art text to vector models like E5-Mistral (note: very different from original E5) don’t even require human curation in the dataset
This is also just eliding the work that humans still have to do.
I give you an audio file (let's say just regular old PCM wav format). You cannot do anything with this without making some decisions about what happens next to the data. For audio, you're very minimally faced with the question of whether to do a transform into the frequency domain. If you don't do that, there's a ton of feature classification that can never be done. No audio to vector model can make that sort of decision for itself - humans have to make the possible.
Raw inputs are suitable for some things, but essentially E5 is just one that has already had a large number of assumptions built into it that happen to give pretty good results. Nevertheless, were you interested for some weird reason in a very strange metric of text similarity, nothing prebuilt, even E5, is going to give that to you. Let's look at what E5 does:
> The primary purpose of embedding models is to convert discrete symbols, such as words, into continuous-valued vectors. These vectors are designed in such a way that similar words or entities have vectors that are close to each other in the vector space, reflecting their semantic similarity.
This is great, but useful for only a particular type of textual similarity consideration.
And oh, what's this:
> However, the innovation doesn’t stop there. To further elevate the model’s performance, supervised fine-tuning was introduced. This involved training the E5 embeddings with labeled data, effectively incorporating human knowledge into the learning process. The outcome was a consistent improvement in performance, making E5 a promising approach for advancing the field of embeddings and natural language understanding.
Hmmm ....
Anyway, my point still stands: choosing how to transform raw data into "features" is a human activity, even if the actual transformation itself is automated.
I agree with your point at the highest (pretrained model architect) level, but tokenization/encoding things into the frequency domain are decisions that typically aren’t made (or thought of) by the model consumer. They’re also not strictly theoretically necessary and are artifacts of current compute limitations.
Btw E5 != E5 Mistral, the latter achieves SOTA performance without any labeled data - all you need is a prompt to generate synthetic data for your particular similarity metric.
> Unlike existing methods that often depend on multi-stage intermediate pre-training with billions of weakly-supervised text pairs, followed by fine-tuning with a few labeled datasets, our method does not require building complex training pipelines or relying on manually collected datasets… We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across nearly 100 languages.
It’s true that ultimately there’s a judgement call (what does “distance” mean?), but I think the original post far overcomplicates what’s standard practice today.
Sorry, I just not believe this generalizes in any meaningful sense for arbitrary data.
You cannot determine frequencies from audio PCM data. If you want to build a vector database of audio, with frequency/frequencies as one of the features, at the very least you will have to arrange for a transform to the frequency domain. Unless you claim that a system is somehow capable of discovering fourier's theorem and implementing the transform for itself, this is a necessary precursor to any system being able to embed using a vector that includes frequency considerations.
But ... that's a human decision because humans think that frequencies are important to their experience of music. A person who totally deaf, however, and thus has extremely limited frequency perception, can (often) still detect rythmic structure due to bone conduction. Such a person who was interested in similarity analysis of audio would have no reason to perform a domain transform, and would be more interested in timing correlations that probably could be fully automated into various models as long as someone remembers to ensure that the system is time-aware which is, again, just another particular human judgement regarding what aspects of the audio have significance.
I just read the E5 Mistral paper. I don't see anything that contradicts my point, which wasn't about the need for human labelling, but about the need for human identification of significant features. In the case of text, because of the way languages evolve, we know that a semantic-free character-based analysis will likely bump into lots of interesting syntactic and semantic features. Doing that for arbitrary data (images, sound, air pressure, temperature) lacks any such pre-existing reason to treat the data in any particular way.
Consider, for example, if the "true meaning" of text was encoded in a somewhat Kaballah-esque type scheme, in which far distance words and even phonemes create tangled loops of reference and meaning. Even a system like E5 Mistral isn't going to find that, because that's absolutely not how we consider language to work, and thus that's not part of the fundamentals of how even E5 Mistral operates.
Understanding audio with inputs in the frequency domain isn’t required for understanding frequencies in audio.
A large enough system with sufficient training data would definitely be able to come up with a Fourier transform (or something resembling one), if encoding it helped the loss go down.
> In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.
Today’s diffusion models learn representations from raw pixels, without even the concept of convolutions.
Ditto for language - as long as the architecture is 1) capable of modeling long range dependencies and 2) can be scaled reasonably, whether you pass in tokens, individual characters, or raw ASCII bytes is irrelevant. Character based models perform just as well (or better than) token/word level models at a given parameter count/training corpus size - the main reason they aren’t common (yet) is due to memory limitations, not anything fundamental.
Great comment about you don't know what you don't know. Also wonderful to see another Steve Reich aficionado. What are some similar pieces to music for 18 Musicians which I adore?
However, I do love several of the pieces by Marc Mellits, check out "Real Quiet plays the music of Marc Mellits". All rather short, but some totally lovely.
You may enjoy an 80's album by Andrew Poppy, in particular the track "The Object is a Hungry Wolf".
Do be sure to check "Shift" by Chris Hughes, which are re-instrumentations of several early Reich pieces including Piano Phase.
It's quite different in many ways from Mf18M, and you may have heard it, but I adore "The Chairman Dances" by John Adams. It couples the "chugging along" of Mf18M with much less minimalist melodic and harmonic structures. YMMV.
I'm sure I am going to remember a bunch more after I send this, so check back here in a couple of days, and I'll reply again with any extras.
Great overview, but the final section doesn't address the obvious question about how to decide between using a "vector store", like Postgres+pgvector vs a "vector database", like Pinecone. I'd love to see another presentation which discusses the various tradeoffs, like query speed, insertion/index-building speed, ease-of-use, and others to help guide people trying to decide which of the options is best for their application.
Since digitaloceanspaces.com is an S3-style hosting provider, it would be neat if Hacker News could special-case it to display something like tge-data-web.nyc3.digitaloceanspaces.com as the domain instead of just digitaloceanspaces.com
From looking at this, I think it’s a very risky starting point for an engineer to kick off from.
Things like mentioning they’re clustered by meaning and optimized for analytics are questionable.
The clustering depends on the embedding you calculate. If you think that the embedding is a good semantic approximation of the data then maybe this is a fine way of thinking about it. But it’s not hard to imagine embeddings that may violate this — eg if I use an audio file and a text file that are identical in meaning through the same embed process, unless it is multimodal they will likely be distant in the embedding vector space.
I fully expect to see embeddings that put things close together in the vector space based on utilization rather than semantic similarity. If I’m creating a recommender system, I don’t want to group different varieties of one off purchases closely. For instance, the most semantically similar flight is going to be another fight to the same destination at a different time or a flight to a nearby airport. But I would want to group hotels often purchased by people who have previously bought the flight.
Vector databases also allow you to provide extra dimensionality into the data, like time awareness. Nothing is forcing you to use a vector that encodes semantic meaning.
And from this, you can see that we’re optimized for lookups or searches based on an input vector. This is not analogous to OLAP queries. This is more akin to elasticsearch than snowflake. If you are using a vector database thinking it’s going to give you reporting or large scale analytics on the vector space afaik there isn’t a readily available offering.
calculating the embeddings is still a mystery to me. I get going from a picture of an Apple to a vector representing "appleness" and then comparing that vector to other vectors using all the usual math. What I don't get is, who/what takes the image as input and outputs the vector. Same goes for documents, let's say i want to add a dimension (another number in the array) what part of the vector database do I modify to include this dimension in the vector calculation? Or is going from doc/image/whatver to the vector representation done outside the database in some other way?
edit: it seems like calculating embeddings would be something an ML algorithm would do but then, again, you have to train that one first. ...it's training all the way down.
Yup it happens outside of the system — but there are a number of perks to being able to store that data in a db — including easily adding metadata, updating entries, etc.
I think in 10y we will see retail systems heavily utilizing vector dbs and many embedding as a service products that take into account things like conversion. In this model you can add metadata about products to the vector db and direct program flow instead of querying back out to one or more databases to retrieve relevant metadata.
They’ll also start to enable things like search via image for features like “show us your favorite outfit” pulling up a customized wardrobe based on individual items extracted from the photo and run through the embedder.
Just one of many ways these products will exist outside of RAG. I think we’ll actually see a lot of the opposite — GAR.
I don't know why PQ is listed as an "indexing strategy". It's a vector compression/quantization technique, not a means of partitioning the search space. You could encode vectors with PQ when using brute-force/flat index, an IVF index, with HNSW (all of which are present in Faiss with PQ encoding as IndexPQ, IndexIVFPQ and IndexHNSWPQ respectively), or even k-D trees or ANNOY if someone wanted to do that.
"Use HNSW or Annoy for very large datasets where query speed is more important than precision": Graph-based methods have huge memory overhead and construction cost, and they aren't practical for billion-scale datasets. Also, they will usually be more accurate and faster than IVF techniques (as you would need to visit a large number of IVF cells to get comparable accuracy), though IVF can scale to trillion-sized databases without much overhead yet with reasonable speed/accuracy tradeoffs unlike other techniques. I'd say "use for medium-scale datasets where query speed is important, yet high accuracy is still desired and flat/brute-force indexing is impractical".
It turns a continuous space into a discrete space. You would first do PQ, then do KNN of the new discrete vector. This way you can compress the vocabulary to a fixed size.
The implementations that I am aware of, including the one in Faiss which I wrote (described in detail in https://arxiv.org/abs/1702.08734), do not index the vector based on its PQ encoding (e.g., in IVFPQ). The IVF cell chosen in which to put the vector is based on its pre-quantized (full precision floating point) representation. It would lose too much precision to perform all comparisons in the compressed space.
Also, distance comparisons are usually of the "ADC" (asymmetric distance comparison) form: the query vector is in full floating point format, and vectors, if quantized/compressed in the database, are effectively decompressed and compared in the full floating point domain. This is true even with PQ, as the distance between the query vector subspaces with each of the 2^n PQ codes for the same subspace are precomputed before comparison (and then the distance computation becomes a lookup-add based on the precomputed distance tables).
LSH techniques unlike PQ are more accurately described as an indexing technique, since the buckets into which vectors are placed are based on their encoding (via hashing/binarization) and are fully compared in that space.
Any recommendations for an embedded embedding database (heh)? Embedded, as in sqlite. For smaller-scale problems, but hopefully more convenient than say, LMDB + FAISS.
You can take a look at txtai (https://github.com/neuml/txtai). It can run in a Python process. It has support for storing content in SQLite and embeddings vectors in local vector index formats (Faiss, HNSW, Annoy).
FWIW, Simon Willison's `llm` tool just uses SQLite plus a few UDFs. The simplicity of that approach is appealing to me but I don't have a good sense of when+why it becomes insufficient.
I found the code of Qdrant very readable (with nice comments in the top) even if you are not (quite) aware of how vector databases (exactly) work. You need to be a programmer, but with that as a given, it reads surprisingly easy for a production database (that companies like OpenAI etc use).
I find it interesting how everyone ignore EuclidesDB (https://euclidesdb.readthedocs.io) which came before Milvus and others in 2018, it is free and open-source. Same for all presentations from major DBs.
If I read organized data from a traditional db, and add new tables based on potential similarities from the existing data/tables, with scores, would that be a vector db?
When does a simple linear search become an issue? For example, a cos similarity function which runs even over a million of vectors is surprisingly fast. We can further shard vectors by user (i.e. if they can only search in their own data they uploaded), or even by document (for things like "chat with this PDF") then it's likely that most searches will happen over small ranges of vectors.
I suspect the main downsides would be increased CPU usage and having to load vectors to RAM all the time. However, for small projects with low RPS you could probably get away without a special database?
Does anyone know of any benchmarks where they compare vector databases to a plain linear search? How they scale in regards to increased vector count and RPS, etc.
Brute force search is both exact (100% accurate) and pleasantly linear – a predictable algorithm. CPUs and caches like that, so performance is much better than you might otherwise expect.
"Brute force doesn’t care about the number of neighbours, so its performance is 679ms/query regardless of “k”. When run in batch mode (issuing all 100 queries at once), average query time drops to 354ms/query."
This was 500 dimensions over a 3.7M dataset (the English Wikipedia), in 2014. So, ~700ms/search, or half that if you can batch several searches together at once. YMMV.
Do you mean using brute force (exact search) and not approximate nearest neighbor search? From my unscientific benchmarks, doing an exact search with 100k vectors takes about 1 second. Usually chatbots take much longer to generate the text, so that's still an acceptable time for doing exact search, which also means you will always find the best matches.
>Do you mean using brute force (exact search) and not approximate nearest neighbor search?
Yes.
>doing an exact search with 100k vectors takes about 1 second.
The speed can depend on the language/runtime of your choice and the number of dimensions. What language/runtime did you use, how many dimensions in a vector? (I heard too many dimensions for vector search is an overkill)
For recognizing features such as hair and skin color, which would do a better job? Machine learning with image classification? Or a vector database?
I've had Weaviate return a dog given a human as input when doing an image similarity search, so I was wondering if there's some way to improve the results or whether I'm barking up the wrong tree.
For those two in particular? You'd definitely get a better result with an ML model such as a convolutional neural net. In some sense, using an image similarity query is a kind of ML model - nearest neighbor - which can work in some scenarios. But for this specifically, I'd recommend a CNN.
I would think you could improve your embedding space to address that issue, partially. Similarity search (as a result of some contrastive loss) definitely suffers at the tails and the OOD is pretty bad. That being said, you're more likely to have higher recall than a more classical technique.
You could consider something like BLIP2. There are multiple ways you could use it: embed images and match them against embeddings of text descriptors, train custom embedding of descriptors, or a classifier on top of the embedding (linear layer on top of the image embedding network).
The approaches increase in complexity. It also allows for dataset bootstrap:
Let's say you want to classify cats by breed. You could start by embedding images and text descriptors and distance-matching the embedded descriptors to the images. This gives you a dataset that might be 90% correct; you can then clean it up, which would be easier to do than manually labelling it.
Based on that improved dataset, you can train a custom embedding for the labels or a classification layer on top of the image embedding network.
You don't use vector databases independently, you need to input the embedding from a ML model.
For your use-case, it should be pretty simple. You could use a CNN and train it, or use YOLO, Deepface, or other face detection algos, then within the face, find the hair and find the skin.
From there you can use a vector database to get the colors that resemble other inputs, or you can use a simple CNN to classify the hair and skin to the closest label.
I watched yesterday talk from PostgresML https://www.youtube.com/watch?v=D8YMMOhAeBs and found it very practical and to the point on how you could use Postgres to build vector database.
When I started playing around with AI applications and learnt about RAG techniques, it took me a while to grok vector databases and was a bit of pain to learn how to set one up. So I built my own little pet project, RagTag [1], which abstracts the vector database behind a simple CRUD API. I simply POST documents to RagTag and they are automatically converted into embeddings and made available to be queried for similarity searches.
Yes, most RAG is done with a vector database. Or something that approximates a vector database, like a traditional database with a vector-index add-on.
Keyword or text matching based search is likely still more popular due to how long its been around, its simplicity, and the tooling built around it. Most companies who have internal search are most likely not using vector / semantic search, but are doing something more basic.
That doesn't make sense. SurrealDB doesn't even tout itself as having good performance. If you don't want a custom vector database, you might as well use postgres.
- A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge, https://arxiv.org/abs/2310.11703
- Survey of Vector Database Management Systems, https://arxiv.org/abs/2310.14021
- What are Embeddings, https://raw.githubusercontent.com/veekaybee/what_are_embeddi...
---
h/t: https://twitter.com/eatonphil/status/1745524630624862314 and https://twitter.com/ChristophMolnar/status/17457316026829826...