Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I work for a vector database company (Pinecone) and can confirm that most of the mind-blowing built-with-ChatGPT products you see launching every eight'ish hours are using this technique that Steve describes. That is, embedding internal data using an LLM, loading it into a vector database like Pinecone, then query the vector DB for the most relevant information to add into the context window. And since adding more context with each prompt results in higher ChatGPT costs and latencies, you really want to find the smallest and most relevant bits of context to include. In other words, search quality matters a lot.

Edit to add: This was an aside in the post but actually a big deal... With this setup you can basically use an off-the-shelf LLM (like GPT)! No fine-tuning (and therefore no data labeling shenanigans), no searching for an open-source equivalent (and therefore no model-hosting shenanigans), no messing around with any of that. In case you're wondering how, say, Shopify and Hubspot can launch their chatbots into production in practically a week.



This technique is no secret, it's officially mentioned over OpenAIs whitepapers, docs and code samples on how to use GPT in a real-world workflow.


Not so secret, and also precisely how Langchain (1) and GPT Index (Llama Index) (2) got so popular. Here's a quick rundown:

0) You can't add new data to current LLMs. Meaning you can't train them on additional data, or fine-tune, leave that more for understanding structure of the language or task.

1) To add external corpus of data into LLMs, you need to fit it into the prompt.

2) Some documents/corpus are too huge to fit into prompts. Token limits.

3) You can obtain relevant chunks of context by creating an embedding of the query and finding the top k most similar chunk embeddings.

4) Stuff as many top k chunks as you can into the prompt and run the query

Now, here's where it gets crazier.

1) Imagine you have an LLM with a token limit of 8k tokens.

2) Split the original document or corpus into 4k token chunks.

3) Imagine that the leaf nodes of a "chunk tree" are set to these 4k chunks.

4) You run your query by summarizing these nodes, pair-wise (two at a time), to generate the parent nodes of the leaf nodes. You now have a layer above the leaf nodes.

5) Repeat until you reach a single root node. That node is the result of tree-summarizing your document using LLMs.

This way has many more calls to the LLM and has certain tradeoffs or advantages, and is essentially what Llama Index's essence is about. The first way allows you to just run embeddings once and make fewer calls to the LLM.

[1] https://langchain.readthedocs.io/en/latest/ [2] https://gpt-index.readthedocs.io/en/latest/guides/index_guid...


can you provide a link to these docs/code samples ?



thank you


How do I calculate the embedding if I have let's say the llama7b weights in huggingface shape?

I cannot use third party apis like openai for obvious reasons.


You're replying to a VP of Marketing, not sure what you're expecting here. This subthread is just an ad for Pinecone if you didn't already realize that.


You can calculate them yourself as well! huggingface has a great article on this: https://huggingface.co/blog/getting-started-with-embeddings

tl;dr, use: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...


Thanks, but I already worked with thus model and it was not good at all for my domain. Therefore, I wanted to finetrain llama for my domain and then use llama for embeddings. Should I finetune this model then?


(I want to focus more attention on that "tl;dr", which I will arguing is carrying a lot of load in that response: the high-level answer to how one does this using the llama weights is "you don't, as that isn't the right kind of model; you need to use a different model, of which there are many".)


so based on this logic, do Google and Facebook have the biggest potential competitive advantage?


I'd say Microsoft. And they've been demonstrating that quite well.


I agree they seem the most active of big tech so far, but in terms of “data moat” competition they are supposed to be behind, as this is not the foundation of their business.


What do you mean by "data moat"? I would imagine that the Bing index is not much smaller than the Google index, if that's what you mean.


I believe in this context "data moat" refers to data they have that other companies can't access. Microsoft has huge amounts of email and other data in Office365. And this has a clear path to monetization since they already have paying customers for Office.

Other moats IMO are Google's with Android and Chrome. And MS possibly with Windows?


Not to mention Github for code.


SharePoint!


I think it's a combination of data, LLM quality, embedding search quality, and creativity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: