Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a lot of stuff in this release.

Don't miss the new llm-cluster plugin, which can both calculate clusters from embeddings and use another LLM call to generate a name for each cluster: https://github.com/simonw/llm-cluster

Example usage:

Fetch all issues, embed them and store the embeddings and content in SQLite:

    paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
      | jq '[.[] | {id: .id, title: .title}]' \
      | llm embed-multi llm-issues - \
        --database issues.db \
        --model sentence-transformers/all-MiniLM-L6-v2 \
        --store
Group those in 10 clusters and generate a summary for each one using a call to GPT-4:

    llm cluster llm-issues --database issues.db 10 --summary --model gpt-4


Very curious to follow your journey through embedding search.

If I want 100 close matches that match a filter, is it better to filter first then find vector similarity within that, or find 1000 similar vectors and then filter that subset?


I experimented with that a few months ago. Building a fresh FAISS index for a few thousand matches is really quick, so I think it's often better to filter first, build a scratch index and then use that for similarity: https://github.com/simonw/datasette-faiss/issues/3

... although on thinking about this more I realize that a better approach may well be to just filter down to ~1,000 and then run a brute-force score across all of them, rather than messing around with an index.


For 1000 points, brute force is super quick. Actually, up to 100k (on my machine), brute force takes less than 1 second.


> jq '[.[] | {id: .id, title: .title}]'

Can be simplified to:

    jq 'map({id,  title})'




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: