Are there established best practices for "engineering" prompts systematically, rather than through trial-and-error?
Editing prompts is like playing whack-a-mole: once you clear an edge case, a new problem pops up elsewhere. I'd really like to be able to say, "this new prompt performs 20% better across all our test cases".
Because I haven't found a better way, I am building https://github.com/typpo/promptfoo, a CLI that outputs a matrix view for quickly comparing outputs across multiple prompts, variables, and models. Good luck to everyone else out there tuning prompts :)
Seems like you would want to apply some NLP to the prompts themselves
Take the gradient of the prompt wrt adjectives, verbs, nouns, etc. I forget the technique, but they add garbage words to the prompt to effectively increase the temperature.
Isn't engineering the application of science to solve problems? (math, definitive logic, etc.)
Maybe one day we'll have instruments that let us reason about the connections between prompts and the exact state of the AI, so that we can understand the mechanics of causation, but until then, I would not think that being good at asking questions is "engineering"
Are most 10 year olds veteran "search engineers"?
Btw I'm asking this slightly tongue-in-cheek, as a discussion point. For example plenty of computer system hacks are done by way of "social engineering", so clearly that term is malleable even within the tech community.
Author of the guide here. I attempt to address this in the "Why do we need prompt engineering?"[^1] section.
> ... we used an analogy of prompts as the “source code” that a language model “interprets”. Prompt engineering is the art of writing prompts to get the language model to do what we want it to do – just like software engineering is the art of writing source code to get computers to do what we want them to do.
Borrowing from Oxford Dictionary, the definition of "engineering" is:
> the branch of science and technology concerned with the design, building, and use of engines, machines, and structures.
I think it's pretty reasonable to say that "prompt engineering" falls squarely in the realm of "technology concerned with the use of a machine".
That's extremely broad. I know how to push the buttons on my washing machine. So now I'm a laundering engineer?
Maybe the answer is "yes", but I think it's a bit silly, and detracts from the engineering work that goes into building the machine, i.e. the work of understanding the connections and flow of electricity from button press to pumps and gears turning.
The person who shoveled coal to a steam locomotive was called a railroad engineer. I guess the distinguishing feature here is how much capital is tied to the process and how much value it creates - not how intellectually demanding it is as such.
If you had a button that switched on a thousand washing machines at a time I guess one could flirt with the title "washing engineer".
> If you had a button that switched on a thousand washing machines at a time I guess one could flirt with the title "washing engineer".
Since that button probably wouldn't work correctly 100% of the time, the operator would probably be forced to learn about problems and solutions related to that activity. Over time, that person would accumulate experience and knowledge which would make them well suited to that simple job of flipping the switch on and off. And at that point, that person would be more of an engineer than a random person who walked up and flipped the switch.
An _anything_ engineer is just someone who does that thing and the unnamed related things necessary to do the main thing.
The person who shoveled coal to a steam locomotive was called a railroad engineer.
Maintaining a constant head of steam while changing altitude, water level, external temperature, etc using a direct heat source is a really skilled job. Talk to anyone who works on a heritage railway and they'll explain just how hard it is.
I lived in the same city where George Stephenson invented the locomotive, and went to the same pub that the volunteers who worked on a replica went to. It's fascinating, and very much 'proper' engineering.
"Intellectually demanding" was a poor choice of word. I meant to say something along the lines "does not require to invent anything new" - which a railroad engineer does not need to do. That does not mean work they do is low-effort or trivial.
Surely we'd agree that not everyone who uses software is a software engineer, but writing software is generally agreed upon as "software engineering". If you consider my comment as a whole, and the linked document, then the application of the definition makes sense.
Large Language Models are incredibly similar to black-box non-deterministic interpreters. And prompt engineering feels very much like writing code for said interpreter.
I think here, like in a lot of places in our societies, there's a gap in language. Take the group of someone's grandma, john carmack, a 3rd year uni student and like a guy that writes YAML to control his home automation system.
2 words surely aren't enough. There's no way to draw one line through that group and get any real meaning (unless maybe it's above the grandma)
So, like in other parts of society, if we don't make new words, people will start using the words we already have.
I want someone to write a sci-fi book where our need for new language snowballs with increased technology and ends up being the great filter for life since it grows until no groups can understand each other.
I've never heard of non-deterministic programming languages. Is that really a thing?
(I mean - cue the JS jokes and all, but, really?)
Fwiw, to me, writing prompts feels most similar to management, directing, maybe psychotherapy (though I never worked in that, just guessing). Knowing how to adapt and use language to guide the "other" to achieve a goal, but without having insight into the exact way they go about it.
I feel like when I talk to ChatGPT I am directing it. "guide to directing AI bots" has its share of problems too, and "directing prompts" is just flat out unsexy.
> > the branch of science and technology concerned with the design, building, and use of engines, machines, and structures.
> I think it's pretty reasonable to say that "prompt engineering" falls squarely in the realm of "use of a machine".
It's not just "use of a machine", but rather "the branch of science and technology concerned with the use of machines.
That said, I don't have a stake in this battle. I just wanted to point out that it's easy to forget the earlier context when looking at lists like that!
But writing “source code” is definitively not “software engineering”. It has a name: programming. A software engineer is the person that creates a system from software mostly using software. AI-assistant engineering makes sense.
Prompt crafting is fair, informative, and accurate. And we (should) all know by now that craftsman/artisan != engineer.
Ask your favorite LLM to categorize the following statements as one of { reasonable, wishful } notions:
- writing prompts for LLMs is engineering. The science of engineering effective LLM prompts.
- writing prompts for LLMs is an art*. The art of crafting effective LLM prompts.
* almost all ‘art’ rests on fairly coherent communal ‘logic’ or ‘schools’. What distinguishes art from science is that this logic does not need have any basis in appealing to reason. Aesthetics in sciences, in contrast, is primarily used to select one of many contending scientific theorems in ambiguous contexts (as in “the beauty of the mathematics of the unverifiable <X> theory to me implies its truth”.)
As the head of the Engineering department said the class on day one of Intro to Computer Engineering at RIT in the fall of 1999:
Engineering = Physics + Economics
The physical system here is the LLM and the computing environment that interfaces with the LLM. Prompt engineering would be the knowledge of how to use the LLM in a programmatic manner. There are many cost related trade-offs. More tokens, better response? Can you limit token usage while keeping the quality response basically the same?
Is it not obviously engineering and not just throwing stuff at the wall when you begin to measure the quality of responses with regards to token usage, finding which approaches work and which don't? Or how to handle security issues like prompt injections? Or techniques for using a vector database with latent space embeddings?
Do 10 year olds do that when they use Google?
I think the confusion here is that people think it refers to just using the ChatGPT interface and not wiring up the API to a Python interpreter.
I agree with the tongue-in-cheek and the serious question. I was asking myself this just yesterday.
Britanica's definition is nice (the application of science to the optimum conversion of the resources of nature to the uses of humankind), but the base of all of them is creating from science.
I'm not sure what the researched-and-validated science is behind these prompting guides. They feel like creative works and really feel like they deserve a title highlighting this.
In a related thought, I feel like we might see creative titles growing in respect over the coming years, as humans look for ways to differentiate themselves from knowledgeable machines.
If programming is alchemy, then prompt engineering is more akin to demonic conjurations (communicating with demons):
- the Demon knows almost everything, but it can/will lie and trick your bullshit filters so well
- then, there's the old trope that you-the conjurer might be able to trick the demon and avoid paying the price for its services (prompt injections) †
(dust off your Ars Goetia, my friend)
But, in all seriousness, here's one definition I like:
engineering:
the branch of science and technology concerned with the design, building, and use of engines, machines, and *structures*.
I read structures in a more generic sense here (e.g. structures of information). Prompt engineering requires skill, e.g. understanding how transformers on a basic level will make you better at it, but the nature of work is still fuzzy. It's more intuition guided by some general understanding of the system.
I think prompt design would be a better term.
Words shift their meanings and language is a river, so I'm ok with engineering being used this way.
I do have a bigger issue with using AI as an umbrella term, because it seems misleading on purpose. The new meaning is a product of a bunch of people trying to make it sound magical so more money changes hands. That's active disinformation and it's making us collectively slightly more stupid.
Some people are way too butt-hurt over a choice of phrasing in a language that absolutely allows for this kind of flexibility. But mostly it's probably some dork in the back fuming away, muttering "I'm a REAL engineer! I went to school to get that title! How dare these intellectual infants abscond with MY valor!"
> Isn't engineering the application of science to solve problems? (math, definitive logic, etc.)
Most definitely not. Engineering more often than not preceeds the science. You don't have to have an analytical and theoretical understanding of something in order to harness it and make practical use of it.
In reality, there's a strong feedback loop, where practical hacks will guide the science, the development of which will can then be used to make better practical applications, which can uncover new unknowns, which can then be incorporated into the science etc. The development of electricity and magnetism and or steam engines and thermodynamics are both great examples of this.
1. Do we start calling travel agents "aviation engineers" because they use airplanes to solve problems?
2. I sure hope the engineers who build planes aren't just winging it (ba doom pssh)
3. Context matters. Scientific knowledge increases over time, but isn't there a difference between willfull and unavoidable ignorance? If an engineer from the early days of flight were transported to SpaceX today, but they ignored the math and constructed rockets that we know with certainty will not fly, but they are convinced it will work due to sheer inner conviction, that's no longer engineering imho. Even if it may have been in their time, as long as they were using the known science of their day.
Design is part of engineering, but engineers will go on to build for example. Many folks like to gatekeep the use of engineer as a word. Mostly it just comes down to the fact that a displine hasn't matured yet and isnt' taught in a rigourous and formal manner. It is still engineering. See Network Engineering for example. Plenty of people building complex systems at scales never before seen. Even more that just about know how traceroute works but still building networks. They are all engineers of a young field of engineering.
Did you even read the article? You are writing programs whose output will be put in hidden prompts. Nobody is claiming that users are prompt engineers. Math is a dying art, programming is a science.
Is it me or is the bot's output in the section "Give a Bot a Fish" incorrect? It states that the most recent receipt is from Mar 5th, 2023 but there are two receipts after that date. This is what worries me about using ChatGPT - the possibility of errors in financial matters, which won't go down well I fear.
Thanks very much for posting this! I haven't yet finished reading the whole thing, but even just the first section about the history of LLMs, explaining some of the basic concepts, etc., I found to be a very well-written and useful info, and it was really nice that it linked out to source material. So many times when you go into reading stuff about the latest AI technique or feature it can feel like you need to do a ton of background reading just to understand what they're talking about (especially as the field moves so quickly), so having a nice simple primer at the beginning of this doc was most appreciated!
I don't believe that it is generative, but I totally understand where you're coming from because it's an entire paragraphs worth of what amounts to simply thanking the author for the project, and doesn't add any salient meaningful content to the conversation.
We'll return to the age of laconic 90s snark, at least until bots start preloading with "Pretend you're an asshole teenager on the NYC subway in the 1990's, ...".
That or we kick into a world of ultra short lived memes, where the models will always lag. A rolling cipher of no-context memetic emojis to indicate someone is behind the keyboard.
I do actually. While I personally am excited about LLMs' potential for good, I think a large swath of the world is ready and equally as excited about their potential for harm / spam / fraud / etc. I'm already seeing bots popping up all over other social channels (Reddit in particular), posting a series of overly-cheery LLM-generated content designed to build up high-karma accounts which then get bought and sold on the not-so-open market so that end users can be further spam'd / defrauded. It sucks.
So I'm personally curious in fine-tuning my own "algorithm" for detecting fake content, and pointing it out is helpful for me and, I presume, others who think similarly (and I know there are others).
In this case I may absolutely have been wrong, but even the comments above added to my knowledge and helped.
So yes, I found it very constructive. I hope others did too.
The suggestion to use markdown tables was quite interesting. It makes a lot of sense, and I haven't seen it described elsewhere.
I have been getting good results by asking GPT to produce semi structured responses based on other aspects of (GitHub) markdown.
In general, I find it very helpful to find an already popular format that suits your problem. The model is probably already fluent in rendering that output format. So you spend less time trying to teach it the output syntax.
I've even had it generate SVG from Graphviz dot syntax reasonably well, including doing basic layout. It's not great (nowhere near good enough to rely on), but given the complexity of graph layout algorithms that's not surprising. That it can even start to do it and deal with the visuals (e.g. try to avoid overlap etc.) was pretty impressive.
Thanks for pointing this out. That was my mistake – my brain must have swapped out "different transformer architectures" with "different model architectures".
Yeah, I think the (worrying) confusion is that Amazon calls it a seq2seq model, which was the name of a SOTA RNN from Google a while back.
Ofc now, seq2seq just means what you said (an encoder/decoder model, which is actually what a “truly vanilla” transformer would be anyway).
The fact that any serious researcher thinks any other serious researchers are using models without self attention is the real red flag here.
No one is trying to use other models anymore because they do not scale. There’s enough variety within transformers that you could argue we need a new level of taxonomy, but transformers are basically it for now.
This reflects astonishingly poorly on Brex. What customer wants to hear that Brex is using "a non-deterministic model" for "production use cases" like "staying on top of your expenses"? I don't see them acknowledge the downsides of that non-determinism anywhere, let alone hallucination, even though they mention the latter. Hallucinating an extra expense, or missing one, could have serious consequences.
This is also potentially terrible from a privacy standpoint. That "staying on top of your expenses" example suggests that you upload "a list of the entire [receipts] inbox" to the model. It _seems_ like they're using OpenAI's API, which doesn’t use customer data for training (unlike ChatGPT), but they should be crystal clear about this. Even if OpenAI doesn't retain/reuse the data, would Brex's customers be happy with this 3rd-party sharing?
The expenses example seems like sloppy engineering too—there's no reason to share expense amounts with the model if you just want it to count the number of expenses. Merchant names could be redacted too, replaced with identifiers that Brex would map back to the real data. These suggestions would save on tokens too.
Despite Brex saying they're using this in production, I suspect it's mostly a recruiting exercise. It's still a very bad look for their engineering.
This seems overall well-written and well-explained, but curious for that piece on fine-tuning. This article only recommends it as a last resort. That makes sense for a casual user, but if you're a company seriously using LLMs to provide services for your customers, wouldn't the cost of training data be offset by the potential gains you have and the edge cases you might automatically cover by fine-tuning instead of trying to whack-a-mole predict every single way the prompt can fail?
The concern with finetuning, even for specialized use-cases, is that you are binding yourself to the underlying model. Given rapid advancements in the field, this does not seem a prudent use of engineering time.
Having a hierarchy of prompts with context stuffing allows for rapid switching across models with a few (non-trivial) surface-level prompt updates while the deeper prompts stay static.
YAML is just as effective at communicating data structure to the model while using ~50% less tokens. I now convert all my JSON to YAML before feeding it to GPT API's
I've heard this a lot but don't understand where this idea comes from. With JSON you can strip whitespace whereas with YAML you're stuck with all these pointless whitespace tokens you can't do anything about.
I would recommend the exact opposite, JSON is just as effective while using less tokens.
This example JSON:
{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}
Is 112 tokens, and the corresponding YAML (which I won't paste) is 206.
This is fair, typically I supply data as compact JSON but ask for responses as pretty printed JSON which is quite a large token penalty but tends to strongly reduce malformed JSON outputs.
Hey there, I'm the author of the prompt engineering guide (and run Brex's Office of the CTO) – I can pull back the curtain a little bit.
I firmly believe that the introduction of LLMs will be as essential to the future of human-computer interaction as the introduction of the mouse and keyboard were. Every technology company with sufficient resources should be exploring the implications of LLMs on their business.
Most of our research for LLMs falls into three buckets:
1) Internal processes – Any area where an employee is writing or reading some kind of communication is up for grabs.
2) Developer productivity – Whether it's improving the quality of code, reducing time to implementation, or answering questions about our services and architecture... empowering our eng team with LLMs is an area of major interest.
3) Customer Experience – Employees frequently need to write memos for expenses or have questions about Travel & Expense policies (e.g. "Can I buy alcohol at this team dinner?").
We're, of course, exploring far more interesting use cases than just what's above, but that's the low hanging fruit.
The release is marketing, both to customers and for hiring. I have zero doubt they made this because they're using LLMs themselves, not just faking it like for fellow kids.
LLMs are valuable to all companies of any significant size, just for knowledge management. Then for a bank specifically, there a ton of text classification and summarization tasks when it comes to expense management, bill pay and all the other services they offer. There's also internal stuff. Fraud and KYC would be helped a lot.
Lets say that you invest in 100 promising technologies. If one of them becomes a black swan, even if all the other 99 failed you still would be winning big.
After looking at their 'AI-enabled' products, it looks more like pure gimmickry.
But at least this isn't a product they would expect to make the majority of their money from since Brex makes its money elsewhere and can afford to waste it on gimmicks and experiments like this.
Unlike the thousands of new so-called AI startups which are just a thin wrapper around OpenAI's API and have no moat.
When Big Data was becoming the hot new thing, I saw people arguing that companies would inevitably need librarians and Masters in Library Science holders etc to wrangle all the information (I guess?)
Perhaps, but this space suffers from the Armageddon astronaut/fire figher problem. It is easier to teach a Computer Science Major good english, than it is to teach an English Major computer science.
I don't think prompt engineering shares a whole lot with humanities higher education, but the argument that English is easier seems very non-obvious to me.
I'd expect the relative skill floor for English to be significantly higher because everyone in the anglosphere gets a minimum of 12 years of intensive English education. Moreover, the starting point for additional higher education is usually native fluency/mastery and even then jobs requiring actual English credentials will usually require a graduate degree on top.
Contrast that with programming, where most developers will have learned their first language in college and those 1-3 years of introductory education were enough that they're usually considered hireable by the end of undergrad. Some (few) people can even get to that point with only a 3-6 month intensive bootcamp despite no prior experience.
I always thought this problem was better thought of as software engineer/farmer problem - it's easer to teach a software engineer about agriculture than the other way around
Software engineering can be self-studied very cheaply, lots of free resources, mostly you just need time and motivation. Failure is usually cheap.
Farming on the other hand requires more local, implicit and hands-on knowledge, capital requirements are high, feedback cycles are slower and failure is expensive.
I would argue the same is true of farming as well, you're just comparing differences in final outcomes. You can start learning basic farming techniques with a few pots, some seeds, water and a sunny spot in the same way you can get a raspberry pi and a keyboard sourcing information for both from YouTube fairly effectively.
There might be an argument for cost at the highest scale in each field which few in either profession really make it to but I'd bet even then its pretty on par.
I've been a huge fan of comparing the complexity or cost of a profession. We as a species specialize because doing these tasks in the most efficient way requires managing large number of details that are not obvious at first glance.
Really? You think someone who spends all day building systems with software will have an easy time studying and gaining a CDL, learning about soil quality, drainage, fertilization, crop strategies, financing agreements (lease to own agreements, commodities futures, even forex), working with hired labor and managing contractors, engine repair and small scale fabrication? I think the farmer would undoubtedly get bored of being cooped up inside all day at a computer, but they wouldn't have a hard time understanding how to create a system from small moving pieces that they can direct.
Substantiate your claim! A farmer is much better generalist than a "software engineer". FFS, the engineer doesn't even use science and the farmer does.
Don't think prompting LLMs require particularly "good" English in the first place. You can say a half baked sentence with typos and it'll still make sense of it.
Plus when you go meta and ask LLMs to generate prompts for themselves, you own language proficiency becomes even less important.
I do think English/language proficiency will help with Generative image AIs like Midjourney. Like if someone could describe a scene in extreme detail that's more likely to produce a result closer to what you want.
f*cking Meta. it took me a second to parse your sentence. at first I thought you meant using Llama or something. instead of prompts generating prompts.
The speed this stuff is moving prompt engineering may just be a fad and the next wave of models changes things again.
If they persist as a thing, then most of the hard work will just be abstracted away using a standard library of prompts available in a point and click fashion for 99% of use cases.
Thats all I think when people espouse the "become a prompt engineer instead!" lines, as if the end goal isn't to remove that exact friction. Otherwise we'd be learning S-LLM-QL instead of "talking" to the bots.
Perhaps this may have been the case early on, but if you observe the trends especially with LLMs moving to zero shot, it's becoming progressively easier to express nuanced instructions with relatively simple prompts.
One thing I haven't heard much discussion about is the fact that ChatGPT is constantly being updated.
This means that if you build a prompt for classification and become confident that you've whacked all of the moles so that it is pretty solid with all of the edge cases, it can later start breaking again.
Some solutions I can think of are 1) choose a fixed model to test against but they become deprecated over time or 2) perhaps fine-tuning might help.
That's a valid concern, I think just like with any other software you need to write tests for the AI model to constantly check if your prompts are working as intended. Basic unit tests would work well in this case.
The differences here compared to unit tests are that the breaking is outside of your control and the updating process is tedious. Also, testing requires making real API calls rather than using stubs so it requires additional infrastructure.
That's all true, but sometimes things break because some package is bumped. I'd still like to know if my app is basically broken if an LLM has changed somehow.
You are right that fine tuning would probably help to minimize the risks, but it probably never can be zero. New tests will also be needed when customers find new edge cases that break our assumptions.
Testing LLM prompts is a new paradigm that we'll have to learn to deal with.
I've been playing Gandalf in the last few days, it does a great job at giving an intuition for some of the subtleties of prompt engineering: https://gandalf.lakera.ai
I also found the nondeterministic behavior of Gandalf robbed it of being "fun," to say nothing of the 429s (which they claim to have fixed but I was so burned by the experience I haven't bothered going back through the lower levels to find out)
Here are a few more great resources from my notes (including one from Lilian Weng who leads Applied Research at OpenAI):
- https://lilianweng.github.io/posts/2023-03-15-prompt-enginee...
- https://www.promptingguide.ai (check the "Techniques" section for several research-vetted approaches)
- https://learnprompting.org/docs/intro