"Search the embedding"? Could you elaborate on this, it sounds interesting!

CGamesPlay · on Dec 20, 2023

I think OP means to filter the user input through an LLM with “convert this question into a keyword list” and then calculating the embedding of the LLM’s output (instead of calculating the embedding of the user input directly). The “search the embedding” is the normal vector DB part.

m-i-l · on Dec 20, 2023

"Query expansion"[0] has been an information retrieval technique for a while, but using LLMs to help with query expansion is fairly new and promising, e.g. "Query Expansion by Prompting Large Language Models"[1], and "Query2doc: Query Expansion with Large Language Models"[2].

[0] https://en.wikipedia.org/wiki/Query_expansion

[1] https://arxiv.org/abs/2305.03653

[2] https://arxiv.org/abs/2303.07678

sroussey · on Dec 20, 2023

Ask the LLM to summarize the question, then take an embedding of that.

I think you can do the same with data you store… summarize it to same number of tokens, then get an embedding for that to save with the original text.

Test! Different combinations of summarizing LLM and embedding generation LLM can get different results. But once you decide, you are locked in the summarizer as much as the embedding generator.

Not sure is this is what the parent meant though.

bayesian_limit · on Dec 28, 2023

I could not help but notice the Contriever curve is so much higher on y-axis Recall than the other methods (figure 11 in https://arxiv.org/pdf/2307.03172.pdf).

Has anyone come across more recent experiments, results, or papers related to this? I'm acquainted with the: - Contriever 2021 paper https://aclanthology.org/2021.eacl-main.74.pdf - Hyde 2022 https://arxiv.org/pdf/2212.10496.pdf

My suspicion is some pre-logic such as is the user's question dense enough then use Hyde with chat history. If anyone has more recent experience with Contrievers, would love to learn more about it!

Feel free to contact me directly on LinkedIn. https://www.linkedin.com/in/christybergman/

sroussey · on Dec 20, 2023

BTW: I think of this like asking someone to put things into their own words, and then it’s easier for them to remember. Matching on your way of talking can be weird from the LLM’s point of view, so use their point of view!

deckar01 · on Dec 20, 2023

It is two different language models. The embedding model tries to capture too many irrelevant aspects of the prompt that ends up putting it close to seemingly random documents. Inverting the question into the LLM’s blind guess and distilling it down to keywords causes the embedding to be very sparse and specific. A popular strategy has been to invert the documents into questions during initial embedding, but I think that is a performance hack that still suffers from sentence prompts being bad vector indexes.

sroussey · on Dec 20, 2023

You can use llama2 to do embedding and summaries and chat.

Turning the docs into questions is something I will test on stuff (just learning and getting a feel).

I am intrigued... what makes a good vector index??

deckar01 · on Dec 20, 2023

My heuristic is how much noise is in the closest vectors. Even if the top k matches seem good, if the following noise has practically identical distance scores, it is going to fail a lot in practice. Ideally you could calculate some constant threshold so that everything closer is relevant and everything further is irrelevant.

sroussey · on Dec 22, 2023

Apologies for being naive, but how do you calculate noise?