At my current company, we used Postgres with pgvector so the text is co-located with the embeddings on the same rows. At first, I was a bit apprehensive about the idea of getting so close to the nitty-gritty technical details of computing vector embeddings and doing cosine similarity matching but actually it has been wonderful. There is something magical about working directly with embeddings. Computing, serializing and storing everything yourself is actually surprisingly simple. Don't let the magic scare you.
Recently I've been doing hardcore stuff like taking an old hierarchical clustering library and substituting the vector distance functions with a cosine similarity function so that it groups/clusters records based on similarity of their embeddings. It's funny reading the README of that 10 year old library and they're showing how to use it to do tedious stuff like grouping together 3-dimensional color vectors. I'm using it to cluster together content based on meaning similarity using vectors of over 1.5k dimensions. Somehow, I don't think the library authors saw that coming.
How great is it to come across a library which hasn't been updated in 10 years and yet is flexible and simple enough that it can be re-purposed to serve a radically more advanced use case which would have been beyond the author's imagination at the time...
I think the most surprising aspect about the whole experience is that working with the embeddings directly makes it feel like your database is intelligent; but you know it's just a plain old dumb database and all the embeddings were pre-computed.
Recently I've been doing hardcore stuff like taking an old hierarchical clustering library and substituting the vector distance functions with a cosine similarity function so that it groups/clusters records based on similarity of their embeddings. It's funny reading the README of that 10 year old library and they're showing how to use it to do tedious stuff like grouping together 3-dimensional color vectors. I'm using it to cluster together content based on meaning similarity using vectors of over 1.5k dimensions. Somehow, I don't think the library authors saw that coming.
How great is it to come across a library which hasn't been updated in 10 years and yet is flexible and simple enough that it can be re-purposed to serve a radically more advanced use case which would have been beyond the author's imagination at the time...
I think the most surprising aspect about the whole experience is that working with the embeddings directly makes it feel like your database is intelligent; but you know it's just a plain old dumb database and all the embeddings were pre-computed.