Can you go into more detail on how it works or provide some references for artic...

jnnnthnn · on May 2, 2024

It is mostly RAG, although I suppose that doesn't say much about the system: one thing I've found is that the way you clean and process the data substantially changes the quality of the results. I'll write a little blog post sharing some of the learnings!

If you feel up for it, you should share your email in the righthand "Unhappy with your results?" widget. My plan is to manually look into the disappointing searches and follow-up with better results for folks, in addition to fixing whatever can be fixed.

Agreed re: searching comments (which it indeed currently doesn't do).

pgillian · on May 2, 2024

i am not surprised that 'the way you clean and process the data substantially changes the quality of the results.' can you share anything about your approach here?

jnnnthnn · on May 2, 2024

I'll write up a little blog post once the traffic dies down a bit!

In the meantime, one thing that comes to mind is that simply embedding the whole contents of the webpages after scraping them didn't yield very good search results. As an example, an article about Python might only mention Python by name once. I found that trimming extraneous strings (e.g. menus, share links), and then extracting key themes + embedding those directly yielded much, much better results.

jnnnthnn · on May 7, 2024

Blog post is now live: https://jnnnthnn.com/building-hacker-search.html

zarathustreal · on May 2, 2024

In our RAG pipeline we found that implementing HyDE also made a huge difference, maybe generating and embedding hypothetical user search queries (per document) would help here

pbronez · on May 2, 2024

HyDE apparently means “Hypothetical Document Embeddings”, which seems to be a kind of generative query expansion/pre-processing

https://arxiv.org/abs/2212.10496

https://github.com/texttron/hyde

From the abstract:

Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder's dense bottleneck filtering out the incorrect details.