Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster
Example usage:
Fetch all issues, embed them and store the embeddings and content in SQLite:
Very curious to follow your journey through embedding search.
If I want 100 close matches that match a filter, is it better to filter first then find vector similarity within that, or find 1000 similar vectors and then filter that subset?
I experimented with that a few months ago. Building a fresh FAISS index for a few thousand matches is really quick, so I think it's often better to filter first, build a scratch index and then use that for similarity: https://github.com/simonw/datasette-faiss/issues/3
... although on thinking about this more I realize that a better approach may well be to just filter down to ~1,000 and then run a brute-force score across all of them, rather than messing around with an index.
Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster
Example usage:
Fetch all issues, embed them and store the embeddings and content in SQLite:
Group those in 10 clusters and generate a summary for each one using a call to GPT-4: