There's a lot of stuff in this release. Don't miss the new llm-cluster plugin, w...

panarky · on Sept 5, 2023

Very curious to follow your journey through embedding search.

If I want 100 close matches that match a filter, is it better to filter first then find vector similarity within that, or find 1000 similar vectors and then filter that subset?

simonw · on Sept 5, 2023

I experimented with that a few months ago. Building a fresh FAISS index for a few thousand matches is really quick, so I think it's often better to filter first, build a scratch index and then use that for similarity: https://github.com/simonw/datasette-faiss/issues/3

... although on thinking about this more I realize that a better approach may well be to just filter down to ~1,000 and then run a brute-force score across all of them, rather than messing around with an index.

rolisz · on Sept 5, 2023

For 1000 points, brute force is super quick. Actually, up to 100k (on my machine), brute force takes less than 1 second.

maleldil · on Sept 5, 2023

> jq '[.[] | {id: .id, title: .title}]'

Can be simplified to:

    jq 'map({id,  title})'