Had to look up the technique, Eguchi method, and it uses color rather than complicated musical notation to associate with each key. Interesting how those who have synesthesia naturally have this same color, key association.
I have a step in my fashion work where I pause and consider what the colors I’m selecting will look and feel like to a normal person. Converting back and forth between [optimistic-seaglass-springtime] and “teal” isn’t very accurate, but I’m certainly accustomed to it. I’m going to try this technique soon now that I know about it; as I’m already color sensitive to pitch, I suspect the value will be in training that sensitivity rather than memorizing their hues.
Most of what we see on Twitter or YouTube is Blind Prompting. However, it is possible to apply an engineering mindset to prompting and that is what we should call prompt engineering. Check out the article for a much more detailed framing.
Dair AI also has some nice info and resources ( with academic papers) about prompt engineering.
Prompt testing, especially when for q/a pairs where there are multiple right answers, has been bugging me a lot
The article is reasonable, but also shows a big gap in tooling, as the techniques there feel closer to linting & typing then testing once you do more interesting prompts. They don't check the interesting parts..
We are helping our users with qa tasks involving code generation, where the answers may be either JSON, executable code, or markdown discussions involving the same. We are tuning for a bunch of tools following that pattern so our users don't have to.
It's easy to make a labeled training set for grading our homework (catching regressions, ...) in the case of classifiers, and that's basically what the blog post showed.
What about for the above qa tasks? We can ask GPT4 whether a generated A was a good answer for a Q, but that's asking it to grade itself. Likewise, in the code case, we can write unit tests for the answers. (Trick: we use the former to more quickly do the latter.) But I feel like there has to be better ways
Another: OpenAI always updates models based on use, so we have to be sure our tests are real holdout sets that never get back to them...
I don't think LLMs are going to be able to solve that. There are a number of things that are assumed are true, but may not necessarily be true. This can potentially lead to multiple possible answers (outputs) given the same inputs.
For example determinism in code, its required for computation and its a system's property, but generalizing a test for it is really hard. Its a property, and by knowing its true or false you can make inferences on whether a system maintains those properties, but most of this is abstracted away at lower levels and since the context can't ever be fully shared with an LLM for evaluation, nor can it automatically switch contexts when evaluation fails, this most likely will never be solveable by computers when there exists one single input that produces two separate (different) outputs, at least from what I know about automata theory and computability.
Its generally considered a class of problems that can't be solved by turing machines.
This flexibility is a big draw. I can experiment with in memory, launch with cloud, and move to my own infra if I’m lucky enough to need that kind of scale.
Interesting read. I’m curious if anyone is familiar with the software they use to connect the genetic sequencing. I’m amazed that they can find migration patterns from this data.
A JavaScript app that automated the resolution of half of Twitter’s support tickets. Logic got refactored after a few years, but still used at Twitter. Probably saved Twitter about $10 million a year over the last ten years.