The reason why Langchain is pointless is that it's trying to solve problems on top of technical foundations that just cannot support it.
The #1 learning is that there is no reusability with the current generation of LLMs. We're using GPT-4 and 3.5T exclusively.
Over the last several months, my team has been building several features using highly sophisticated LLM chains that do all manner of reasoning. The ultimate outputs are very human-like to the point where there is some private excitement that we've built an AGI.
Each feature requires very custom handwritten prompts. Each step in the chain requires handwritten prompts. The input data has to be formatted a very specific way to generate good outputs for that feature/chain step. The part around setting up a DAG orchestration to run these chains is like 5% of the work. 95% is really just in the prompt tuning and data serialization formats.
None of this stuff is reusable. Langchain is attempting to set up abstractions to reuse everything. But what we end up with a mediocre DAG framework where all the instructions/data passing through is just garbage. The longer the chain, the more garbage you find at the output.
We briefly made our own internal Langchain. We tore it down now. Again not that our library or Langchain was bad engineering. It's just not feasible on top of the foundation models we have right now.
100% this! What is worse is that LangChain hides their prompts away, I had to read the source code and mess with private variables of nested classes just to change a single prompt from something like RetrievalQA, and not only that, the default prompt they use is actually bad, they are lucky things work because GPT-3.5 and GPT-4 are damn smart machines, with any other open LLM, things break. I was hoping for good defaults, but they are not, the prompt I wrote over 6 months ago little after the launch of ChatGPT to do some of the same things work much better.
Would you have anything you can share with us about those "several features using highly sophisticated LLM chains that do all manner of reasoning", I'm really curious about the challenges, the process and insights there
Can you share some insights/examples, if you can, on how you improved the prompts? One I feel is particularly poor is the next question generation/past question condensation prompts which are used to refine the user's input based on the history, so that the query includes all the context required for the question, and hence, incorporating "memory".
Yeah I never know where memory goes exactly in langchain, it's not exactly clear all the time. But sure, the main insight I remember is this, take a look at their MULTI_PROMPT_ROUTER_TEMPLATE: https://github.com/hwchase17/langchain/blob/560c4dfc98287da1...
It's a lot of instructions for an LLM, they seem to forget an LLM is an auto-completion machine, and which data it is trained on. Using <<>> for sections is not a normal thing, it's not markdown, which probably the thing read way more often on the internet, instead of open json comments, why not type signatures, instead of so many rules, why not give it examples? It is an autocomplete machine!
They are relying too much on the LLM being smart because they probably only test stuff in GPT-4 and 3.5, but with GPT4All models this prompt was not working at all, so I had to rewrite it, for simple routing, we don't even need json, carying the `next_inputs` here is weird if you don't need it.
Much of why this stuff is not reusable is that eventually someone in the NLP world is going to properly migrate the features for promopt engineering that the coomers over in stable-diffusion/automatic1111 land have "pioneered", such as token weighting, negative prompts, token averaging, or etc. Literally all of these techniques work with regular LLMs (if you don't believe me, see here: https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...). NLP folks just haven't built the right tooling for it. Particularly sad since there's supposed to be an "Automatic1111 for LLMs" project called "Oogabooga" but it doesn't have any of the good features.
The future of LLM prompting will involve highly specialized and engineered prompts, much as is the case with most images seen on civit.ai
We are all likely to eventually throw away a lot of our current prompts
Automatic111 is the domain of Jupyter - desktop experimentation. When you go into production, there are tons of additional pieces of complexity that start hitting you - like prompt routing. So the problem space is different.
We have a simple concept - Generative AI is config management. We model it on top of config management grammar that is proven to work in large production config - jsonnet.
100% agreed. I've used GPT professionally and we would try out different hosts, AI21, etc. and it there were always clear quality issues with just re-using your prompt and hyperparameters. Some of that was down to other models being lesser quality, but we'd also need to re-tune prompts when upgrading to new OpenAI models for the best effect. It turns out that LLMs aren't quite a commodity.
This is precisely why open source models will be limited. Most of the capabilities distinguishing GPT and later Gemini are emergent behaviors from the large parameter count the open source community is saying is not needed (at least for now).
That's part of the reason why we need LLMs to run locally (on our own or rented infrastructure). Another reason is protecting the company IP. None of the medium/large corporations want their IP to be leaked to AI providers.
How do you deal with the prompt iteration phase and how coupled is that to the DAG phase? I've only worked on a few proofs of concept in this phase, but a thing I struggled with was a strong desire to allow non technical colleagues to mess with the prompts. It wasn't clear to me how much the prompts need to evolve in tandem with the the DAG and how much they can exist separately
There are a few increasingly harder things when it comes to prompt customization:
1. Prompts ask LLM to generate input for the next step
2. Prompts ask LLM to generate instructions for the next step
3. Prompts ask LLM to generate the next step
Doing #3 across multiple steps is the promise of Langchain, AutoGPT et al. Pretty much impossible to do with useful quality. Attempting to do #3 very often either ends up completing the chain too early, or just spinning in a loop. Not the kind of thing you can optimize iteratively to good enough quality at production scale. "Retry" as a user-facing operation is just stupid IMO. Either it works well, or we don't offer it as a feature.
So we stopped doing 3 completely. The features now have a narrow usecase and a fully-defined DAG shape upfront. We feed some context on what all the steps are to every step, so it can understand the overall purpose.
#2, we tune these prompts internally within the team. It's very sensitive to specific words. Even things like newlines affects quality too much.
#1 - we've found it's doable for non-tech folks. In some of the features, we expose this to the user somewhat as additional context and mix that in with the pre-built instructions.
So #2 is where it's both hard to get right and still solvable. Every prompt change has to be tested with a huge number of full-chain invocations on real input data before it can be accepted and stabilized. The evaluation of quality is all human, manual work. We tried some other semi-automated approaches, but just not feasible.
All of this is why there is no way Langchain or anything like it is currently useful to built actually valuable user-facing features at production scale.
What if you built a scoring system for re-usable action sequences that are stored in a database, and then have the LLM generate alternate solutions and grade them according to their performance?
An action sequence of steps could be graded according to whether it was successful, it’s speed, efficiency, cleverness, cost, etc.
You could even introduce human feedback into the process, and pay people for proposing successful and efficient action sequences.
All action sequences would be indexed and the AI agent would be able to query the database to find effective action sequences to chain together.
The more money you throw at generating, iterating, and evolving various action sequences stored in your database, the smarter and more effective your AI agent becomes.
Would love to see an open-source version of the internal Langchain you built and what you did differently from an architecture standpoint that made it better in your use-case.
this is precisely the problem i encountered and tried to solve with Edgechains. we think Generative AI is a config management problem (like Terraform or Kubernetes).
>None of this stuff is reusable. Langchain is attempting to set up abstractions to reuse everything. But what we end up with a mediocre DAG framework where all the instructions/data passing through is just garbage. The longer the chain, the more garbage you find at the output.
chains X prompts X LLMs == pods X services X nodes in Terraform.
So we model it on top of config management grammar that is proven to work in large production config - jsonnet.
I saw your comment, got curious, and looked at a lot of your old comments. Lots of interesting insights - Thanks for sharing them.
If you don't mind me asking, what do you do? I'm a researcher at FAANG working on language models and starting a new company in the space. Would love to connect. Feel free to email me - idyllic.bilges0p@icloud.com
The #1 learning is that there is no reusability with the current generation of LLMs. We're using GPT-4 and 3.5T exclusively.
Over the last several months, my team has been building several features using highly sophisticated LLM chains that do all manner of reasoning. The ultimate outputs are very human-like to the point where there is some private excitement that we've built an AGI.
Each feature requires very custom handwritten prompts. Each step in the chain requires handwritten prompts. The input data has to be formatted a very specific way to generate good outputs for that feature/chain step. The part around setting up a DAG orchestration to run these chains is like 5% of the work. 95% is really just in the prompt tuning and data serialization formats.
None of this stuff is reusable. Langchain is attempting to set up abstractions to reuse everything. But what we end up with a mediocre DAG framework where all the instructions/data passing through is just garbage. The longer the chain, the more garbage you find at the output.
We briefly made our own internal Langchain. We tore it down now. Again not that our library or Langchain was bad engineering. It's just not feasible on top of the foundation models we have right now.