I work for a vector database company (Pinecone) and can confirm that most of the...

siva7 · on March 23, 2023

This technique is no secret, it's officially mentioned over OpenAIs whitepapers, docs and code samples on how to use GPT in a real-world workflow.

sjnair96 · on March 23, 2023

Not so secret, and also precisely how Langchain (1) and GPT Index (Llama Index) (2) got so popular. Here's a quick rundown:

0) You can't add new data to current LLMs. Meaning you can't train them on additional data, or fine-tune, leave that more for understanding structure of the language or task.

1) To add external corpus of data into LLMs, you need to fit it into the prompt.

2) Some documents/corpus are too huge to fit into prompts. Token limits.

3) You can obtain relevant chunks of context by creating an embedding of the query and finding the top k most similar chunk embeddings.

4) Stuff as many top k chunks as you can into the prompt and run the query

Now, here's where it gets crazier.

1) Imagine you have an LLM with a token limit of 8k tokens.

2) Split the original document or corpus into 4k token chunks.

3) Imagine that the leaf nodes of a "chunk tree" are set to these 4k chunks.

4) You run your query by summarizing these nodes, pair-wise (two at a time), to generate the parent nodes of the leaf nodes. You now have a layer above the leaf nodes.

5) Repeat until you reach a single root node. That node is the result of tree-summarizing your document using LLMs.

This way has many more calls to the LLM and has certain tradeoffs or advantages, and is essentially what Llama Index's essence is about. The first way allows you to just run embeddings once and make fewer calls to the LLM.

[1] https://langchain.readthedocs.io/en/latest/ [2] https://gpt-index.readthedocs.io/en/latest/guides/index_guid...

misiti3780 · on March 23, 2023

can you provide a link to these docs/code samples ?

gk1 · on March 23, 2023

https://github.com/openai/openai-cookbook/blob/main/examples...

https://github.com/pinecone-io/examples/blob/master/generati...

https://www.pinecone.io/learn/openai-gen-qa/

https://www.youtube.com/watch?v=tBJ-CTKG2dM&t=787s&ab_channe...

There are more out there but hopefully this gets you started.

misiti3780 · on March 23, 2023

thank you

meghan_rain · on March 23, 2023

How do I calculate the embedding if I have let's say the llama7b weights in huggingface shape?

I cannot use third party apis like openai for obvious reasons.

lolol0lol0l · on March 23, 2023

You're replying to a VP of Marketing, not sure what you're expecting here. This subthread is just an ad for Pinecone if you didn't already realize that.

a5huynh · on March 23, 2023

You can calculate them yourself as well! huggingface has a great article on this: https://huggingface.co/blog/getting-started-with-embeddings

tl;dr, use: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

meghan_rain · on March 24, 2023

Thanks, but I already worked with thus model and it was not good at all for my domain. Therefore, I wanted to finetrain llama for my domain and then use llama for embeddings. Should I finetune this model then?

saurik · on March 23, 2023

(I want to focus more attention on that "tl;dr", which I will arguing is carrying a lot of load in that response: the high-level answer to how one does this using the llama weights is "you don't, as that isn't the right kind of model; you need to use a different model, of which there are many".)

thejackgoode · on March 23, 2023

so based on this logic, do Google and Facebook have the biggest potential competitive advantage?

gk1 · on March 23, 2023

I'd say Microsoft. And they've been demonstrating that quite well.

thejackgoode · on March 23, 2023

I agree they seem the most active of big tech so far, but in terms of “data moat” competition they are supposed to be behind, as this is not the foundation of their business.

sebzim4500 · on March 23, 2023

What do you mean by "data moat"? I would imagine that the Bing index is not much smaller than the Google index, if that's what you mean.

alach11 · on March 23, 2023

I believe in this context "data moat" refers to data they have that other companies can't access. Microsoft has huge amounts of email and other data in Office365. And this has a clear path to monetization since they already have paying customers for Office.

Other moats IMO are Google's with Android and Chrome. And MS possibly with Windows?

kristofferR · on March 23, 2023

Not to mention Github for code.

fomine3 · on March 24, 2023

SharePoint!

gk1 · on March 23, 2023

I think it's a combination of data, LLM quality, embedding search quality, and creativity.