Vector support in PostgreSQL services to power AI-enabled applications

hot_gril · on July 1, 2023

Wow. I take back the doubts I had about Postgres vectors taking off in https://news.ycombinator.com/item?id=36481926#36484572 , cause looks like they made a significant investment around it.

jsight · on July 2, 2023

I'm new to this field, and I have questions.

- It seems like these embeddings are learned during the training phase. Presumably by backpropagating to the embedding during each epoch?

- Given that they are learned, doesn't that mean that they are completely context specific to a given trained model? Ie, they can't readily be shared on their own?

- What does "similar" mean here? Is there some emerging practice on how to determine how close is close "enough" between multiple vectors for the purpose of similarity searches? Is this too determined by the model weights somehow?

I hope I'm not missing something fundamental with these questions. Or maybe I hope I am missing something and someone points out my errors. That's good too. :)

mindwok · on July 2, 2023

I'm also new to this field, but I think I can answer some of these:

1. They are learned during training, not sure about the second part.

2. There's two parts here: First, they are not context specific to the model that learned them. This was a problem in earlier embeddings (like Word2Vec) where embeddings are static values and would be dependent on the context for the model. However Transformers (like GPT) generate context aware embeddings, which means the model understands that words can have different meanings depending on their context. The second part is can you share them on their own, and that answer is not really because the context-aware embeddings are produced by the neural network itself so you can't really separate the embeddings and the model, because the embeddings ARE the model.

3. 'Similar' in this case means what they call 'semantic similarity' which is a measure of how close in meaning two inputs are. It's usually calculated using cosine similarity which allows you to measure the closeness of two vectors in arbitrary dimensions.

singhrac · on July 2, 2023

It's worth mentioning that there are embeddings inside the model (e.g. what vector does each token correspond to / nn.Embedding) and the "embedding" of an input, which are used in the same context. They might not even live in the same space (so are not comparable) but the former are learned during training, and the latter are computed during inference.

For example, take a sentence like "The quick brown fox jumps over the lazy dog". A tokenizer turns that into "[The][<S>][qui][ck][<S>][br][own][<S>]..." which corresponds to a set of indices in a lookup table: "12,0,653,34,0,...". Those are then looked up in the model embedding (creating e.g. 300-dimensional vectors for each token). The model is run over that output, creating a 500-dimensional vector for the sentence (for example, if the model is an nn.Linear(300, 500)).

Why would we want vector databases for this? Suppose the sentence is actually a help text question. The final 500-dimensional vector for the question and the correspond text of the answer could be stored in the database. When a new user asks a new question, you can run the model to create a new 500-dimensional vector, find the nearest answered-question vector, then return the corresponding answer text.

So to answer your questions: (1) Yes, the model embeddings (300 dim above), but not all "embeddings" (500 above). (2) Yes (in both cases). (3) Yes, people usually define a similarity measure (e.g. cosine similarity as mentioned by siblings) and train the model such that if texts X, Y are similar (X ~ Y) as defined by humans, then f(X) ~ f(Y) (where f is the neural network).

JimmyRuska · on July 2, 2023

You might be thinking of finetuning, you should checkout alpaca-lora documentation.

Embeddings are just floating point array representation of the underlying text, where 'tokens' that are often used together are numbers close in proximity. You can generate vector representation of any text very easily using the openai apis, or frameworks like langchain https://github.com/openai/openai-cookbook/blob/main/examples...

Similarity means your text gets turned into an array of numbers and the vector database finds the closest matches, potentially across a huge database of text documents. Vector databases are often used in conjunction to LLMs, for example to pull out all the snippets of relevant text, then feed it within the prompt with your question.

throwaway084t95 · on July 2, 2023

The relevant embeddings here are computed by a neural network on high-dimensional input, like an image or a string of text. For instance, https://openai.com/research/clip. So they are not learned by backpropagating to the embedding itself (which is a common approach when you have a relatively small, fixed number of possible inputs, so you can just put your embeddings in a table). The embedding network will of course be trained by backpropagation, using some form of self-supervised learning

morkalork · on July 2, 2023

>Is there some emerging practice on how to determine how close is close "enough" between multiple vectors for the purpose of similarity searches?

IMO you can't avoid empirically evaluating what threshold is close enough or not. Two phrases or encoded entities may have .98 similarity with one embedding model and .59 with another. It is entirely determined by the weights and model you are using.

trolan · on July 2, 2023

> - It seems like these embeddings are learned during the training phase. Presumably by backpropagating to the embedding during each epoch?

As far as I understand it, your first part is accurate.

> - Given that they are learned, doesn't that mean that they are completely context specific to a given trained model? Ie, they can't readily be shared on their own?

Yes, but you only have to use that context model in order to generate new embeddings to add or search the existing store. You can then use whatever is found wherever you want. Think of the vector store as a searchable database, where the vector is unique to the exact text embedding (like a hash function I believe, but I think without the possibility of collisions). You can search and compare vectors, then use the vector to determine what text was embedded (using simple relational queries or storing the text in the same row as the vector).

It means you can store chunks of information as embeddings in a vector database, search for similar content to a query and get chunks of text back related to what you searched.

> - What does "similar" mean here? Is there some emerging practice on how to determine how close is close "enough" between multiple vectors for the purpose of similarity searches? Is this too determined by the model weights somehow?

Cosine similarity search is the most used recently because it's what OpenAI recommends. There's also dot product. They are doing calculations to find the closest other vector(s) across the dimensions of the embedding vector. A 2D plane with a vector on it is 2 dimensions. Some models embed 700-800 different dimensions. OpenAI uses 1536 dimensions.

Imagine comparing a dozen 2D vectors to each other to find the three with the closest angle to a chosen other one. Somewhat easy to move them and overlay, and if they are pointed in the same direction they are likely related. Doing so in 1536 dimensions can't be imagined, but there is still an angle between the vectors with a dot product. It's this mathematical similarity which is used.

> I hope I'm not missing something fundamental with these questions. Or maybe I hope I am missing something and someone points out my errors. That's good too. :)

The use of the Postgres vector store is more from the use after an embedding model has been trained and there is a use for the data being embedded, like helping customers shop for similar items based on their search, and returning descriptions; or feeding a user question in to find context to feed to an LLM for few shot prompting.

willcodeforfoo · on July 1, 2023

Hopefully DigitalOcean follows soon with pgvector support in its hosted offering.

throwaway159515 · on July 2, 2023

It already has it, try loading the extension

retendo · on July 13, 2023

Good catch! Worked for me with Postgres 15. Did not work for me with Postgres 12.

throwaway5959 · on July 1, 2023

I’m surprised it’s taken this long.

marviel · on July 2, 2023

Thanks again to the Supabase team, for constantly living up to their open source ethos.

lfittl · on July 2, 2023

I think Supabase generally does good work, but I don't think they can be given credit for pgvector, if that's what you're indicating (I might have misread).

As I understand, Andrew Kane is the principal author of pgvector, and has worked on it for almost two years before Supabase added support for it.

See also https://github.com/pgvector/pgvector/issues/54 and https://github.com/supabase/postgres/pull/472.

dlojudice · on July 1, 2023

Would be great to have a comparative matrix with others vector databases

shoo · on July 1, 2023

this is a marketing piece from a vendor of cloud services, if they have a comparative matrix, the matrix will support a decision of purchasing one or more of this vendor's services

williamstein · on July 1, 2023

The post repeatedly claims that pgvector is “efficient”.

moab · on July 2, 2023

Not sure why you're being downvoted. IIUC, pgvector uses a clustering implementation similar to ones implemented by FAISS. These are pretty simple / straightforward to implement but do not give the best performance. For more on the current SOTA, which are primarily graph-based algorithms like HNSW and Vamana I would check out https://big-ann-benchmarks.com/

Tostino · on July 1, 2023

I haven't actually used it yet, it can be very efficient to compute right next to where your data is though. A lot of inefficiency comes into play when you're marshalling data around between servers.

And you want to grab that fish out of the oven. In my opinion, the use case is not to install this on your OLTP database. You want this on some side server where you could apply change data capture.

mrwnmonm · on July 2, 2023

Vector?

Silicon Valley and FDA, please group in a room and decide on only one of these:

1- Find an effective drug for weak memory and attention in the modern world.

2- Stop inventing new terms.

killjoywashere · on July 2, 2023

wat? The topic is pretty clearly vectors in the mathematical sense (1), as coined by Hamilton in 1846 (2), which predates the FDA by 60 years (3)

(1) https://github.com/pgvector/pgvector

(2) https://www.maths.tcd.ie/pub/HistMath/People/Hamilton/OnQuat...

(3) https://www.fda.gov/about-fda/changes-science-law-and-regula...

mrwnmonm · on July 2, 2023

Ok sorry

defrost · on July 2, 2023

Vector is an old term going back to at least the 19th century with its use by Josiah Willard Gibbs and Oliver Heaviside.

Do you have a preferred term for points in an N-space?