Building Boba AI: Lessons learnt in building an LLM-powered application

mvdtnz · on June 29, 2023

I am so despondent at the lack of creativity in most of the (many, many) LLM powered projects that are popping up. I have seen hardly a single thing that goes beyond "it's a chat bot, but with a special prompt". Like, is this the best we can expect from this supposedly ground-breaking technology?

mark_l_watson · on June 29, 2023

I agree. I wrote a book on LangChain and LlamaIndex [1] and was initially very enthusiastic about possible applications. However, most of what I now do is just writing simple scripts to interact with my data. I feel like my “lines of code per month” metrics are at an all time low. I wanted a local chat interface that worked with all the books I have written, and various local PDFs that I have collected on the semantic web and other technologies. I ended up with two Python scripts that use a new library embedchain (which uses LangChain) that total 30 lines of code. So easy to do this stuff, that I am not so sure about the idea of using products from the new flood of LLM startups. All of this does require either using the OpenAI APIs, or the Hugging Face APIs, or renting something like a Lambda Labs GPU server and running a 33B model (if you use FastChat, you get an OpenAI compatible API).

I would urge companies and people to build their own stuff because there is so much value in learning the tech. For off the shelf LLM tools, it is hard to beat OpenAI’s web app, Microsoft Bing+ChatGPT and Office 365, and Google’s beta Bard integrations with Google Docs, etc.

[1] https://leanpub.com/langchain

jerrygenser · on June 30, 2023

Part of this post is very similar to a famous reply from Dropbox launch on hb when someone said roughly: "why do we need Dropbox, here's steps to roll your own".

The average non tech person doesn't know what an http API is.

popcorncowboy · on June 30, 2023

This trope gets trotted out every time someone casts a skeptical eye on buzzy tech. In this case GP is saying that OpenAI, Microsoft and Google's apps are already "Dropbox-y" enough and that (paraphrasing fairly hard) the flood of "AI" startups are wafer thin combinations of "Bootstrap UI + GPT API calls + a vector store".

Maybe there's a killer unicorn app hiding in one of those wafer thin wrappers, but OP's point is precisely that it's disappointing how wafer-ish all these "exciting applications" actually are, relative to the massive sense of expectation that existed only a few months ago.

ekabod · on June 30, 2023

>[1] https://leanpub.com/langchain

Too short to call it a book.

mark_l_watson · on June 30, 2023

For what it's worth, I am working on a new chapter, which will be the 5th update to the book. This topic is a moving target!

tannedNerd · on June 30, 2023

I mean if DnD can release a book that’s 43 pages long (with many pictures), this more than qualifies

carlossouza · on June 30, 2023

Remember that the most downloaded app in Apple Store’s first year was this infantile thing called iBeer… it actually hit the mark of $20k of sales/day

This was an extreme bizarre example, but most mobile apps from the first wave were also pretty dull

Give it some time. I think something similar will happen: most LLM apps from this first wave will be forgotten soon, and when people shift the focus from the technology back to the users and their real pain points, the good stuff will surface

ebiester · on June 29, 2023

Because these are the people who are trying to "front run" with an MVP. They think if they can build something and get feedback, they'll be ahead with their fuller vision.

I see a lot of companies that are experimenting with it, and there are a few press releases here and there, but there is a next generation that will be more ambitious. You just can't build more ambitious in the timeframe yet.

rvz · on June 29, 2023

Exactly.

All these projects are desperately slapping buzzwords such as ‘LLMs’, ‘AI’, ‘ChatGPT’, etc to pretend that their product is somehow revolutionary despite sitting on someone else’s AI model in the cloud via an API that they do not own.

They are slowing realizing that there is little to no moat with these LLMs let alone any serious use cases other than summarizing / rewording text.

Everything else requires the triple checking of another human to review the bullshit that the so-called black-box AI outputs.

bugglebeetle · on June 29, 2023

Most of the stuff it’s actually good at (like NLP tasks) are both super boring and require a secondary layer of processing to catch hallucinations. Not as cool of a sales pitch to everyone on the “it’s alive!” hype train.

fhd2 · on June 30, 2023

I've been working on NLP stuff using LLMs for a while, and it's not that the problems aren't somewhat interesting, but solving them is ridiculously tedious.

Most of the time on that project I've just played around with different prompts to make it do what I'm hoping for, with almost no mental process for understanding problems and solutions, mostly just randomly coming up with experiments and reading the results very carefully, checking how consistent they are.

I moved towards finding and isolating the parts LLMs are good (and reliable!) at and using deterministic approaches for everything I possibly can. That part is not too tedious, but all this black box trial and error (with all the waiting and errors)...

Luckily the client doesn't expect LLMs to be what the hype says and mainly just wants some reasonably useful features so they can say they use AI - wouldn't enjoy dealing with someone thinking it's super easy and I just need to write a few scripts because they tried this in ChatGPT once.

msp26 · on June 29, 2023

Any advice for people building the second layer stuff?

bugglebeetle · on June 30, 2023

Just basic sanity checking. You can use an LLM call for this or something procedural. Generally, it’s just of the order of like “does the output conform to this structure”, “is the returned text actually in the input” etc

pertymcpert · on June 29, 2023

Same. You just know most of the paid apps are going to be abandoned in a few months.

brucethemoose2 · on June 30, 2023

Thanks to this comment:

https://news.ycombinator.com/item?id=36529885

I realized how primitive even pure prompting is. Stable Diffusion is kind of primitive too, but the input/prompting methods it has are lightyears ahead.

Der_Einzige · on June 30, 2023

All techniques for prompting in Stable Diffusion work with regular LLMs. I have complained bitterly about this lack of tooling in NLP and wrote a github gist with sample implementations to prove this.

NLP folks don't know shit about prompt engineering, which is ironic.

https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

I still can't emphasize certain tokens in ChatGPT, or mathamatically average them. Not sure why NLP folks don't bother implementing these things, even in the oogabooga frontend (which is supposed to be the automatic1111 of LLMs)

JonathanFly · on June 30, 2023

I just realized we already chatted briefly about banned tokens (I'm the Grover tongue twister guy) but I somehow completely missed this gist at that time. Total facepalm moment, would have been helpful reference.

Der_Einzige · on June 30, 2023

I'd love to continue this conversation more, as I was going to write a long reply to your previous comment. Please reach out to the email in my "about" box. Would love to chat!!!

JonathanFly · on June 30, 2023

AHhhhhhhhhhhhhhhhhhhhhhhhhhhhh. That's me screaming. I constantly wonder why this stuff does not exist in LLMs. But my technical depth and competence is quite low. Way lower than the people implementing the models and samplers. So I just assume: there must be a good reason, right. Right?

But recently I threw a just a bit of similar-ish stuff as you describe there into a TTS model, barely knowing anything, and yeah it's totally works and is fun and cool. The stuff that doesn't work fails in interesting and strange ways, so it almost STILL works. (Well, it gives people really bizarre speech impediments, at least...)

I was just working on prompt editing actually. Which is weird to imagine in a TTS model. It makes sense for the future tokens of course, for words the model has not said yet. I think it even makes sense for the past right? You can rewrite the past context, and it still changes future output audio model. In bark it's two different things: one is the text prompt, and one is the generated audio tokens/context, which is not the same. (The text and the past audio is concatted in the Bark prompt, so this idea makes sense in Bark but not in other models. You could change either text OR 'what was generated with the text' independently.)

As long as you don't rewrite the time touching the last token, at 0 seconds - if it's like a segment 2 to 4 seconds in the past, it should influence future output but not cause a discontinuity in the audio. I think?

BTW an easy and fun thing - just let generation parameters be dependent variables. Of anything.

A trivial example: why is temperature just a number, why not a function? Like the temp varies according to how far long in the prompt you are. For music, just that is already a fun tool. Now as a music segment starts or ends the style transitions. Or: spike the temperature at regular intervals - like use a sine wave for temp, input is current token position. You can probably imagine that works great in music model.

Even in a TTS model this you can get weird and diverse speech patterns.

The thing is: I really very a low level of competence. Total monkey hitting keys and googling, and even I can make it work, easily. Sampling is just a loop, okay, what if I copy logits from sample A and subtract them from sample B. What if take the last generation, save the tokens, ban then in the next. Really just do anything and you end up in interesting places in the model you didn't know existed and are often cool. (Recently, TTS output overlapping speech, for example.)

Like I recently generated french accents from any voice in the Bark TTS model, with no fine-tuning, no training, actually not even really any AI. Just by counting token frequencies in the french voices, and having the sampler loop go, "Okay let's bump these those logits up a bit, and the others down" and it just somehow works. No Loras, no fine-tuning, not stats, it's like middle school level math, but sounded great.

(I'm in a bit of a stream of consciousness ramble mode from lack of sleep, but I'll keep going on this message anyway so I don't forget to come back to your post when I'm back at normal capacity. And just hope I don't cringe too hard reading this when better rested.)

Oh I'd love to hear your thoughts on negative prompts in LLMs.

1) What does 'working correctly' look like?

For an audio LLM, I'm thinking something like: a negative prompt "I'm screaming and I hate you!!!" makes the model more inclined to generate quieter, friendly speech, in your positive prompt. Something like that?

2) How to make it work.

This is probably very model dependent and fiddly. My first thought was generate two samples in sequence. The first sample is the negative prompt. Save all the logits and tokens. Use them as a negative influence in the second prompt. At least in Bark you can't just like flat subtract them or what you actually get is more like 'the opposite of speech' than 'the opposite of your prompt' but when I did french accents I basically just fiddled with a bunch of constant values and weights and eventually it worked. So I'm hoping the same applies. I can imagine a more complicated versions where you do some more math to figure out what's unique about a text prompt, versus 'a generic sentence from that language' and only push on those logits. I suppose that might be necessary.

herval11 · on June 29, 2023

>Like, is this the best we can expect from this supposedly ground-breaking technology?

There's a insidious reason for this. It's because LLMs are one of the few technologies we don't fully understand and we can't fully control.

It's a stark contrast with traditional engineering.

potatoman22 · on June 30, 2023

There's cool stuff going on in medicine https://www.nature.com/articles/s41586-023-06160-y

magicseth · on June 30, 2023

I'm genuinely curious how you would classify my product, Wonder an AI powered browser for kids: hellowonder.ai

Does it cross the threshold?

tuchsen · on June 30, 2023

Yeah I'll raise my hand here to say I did a file manager that automates file management tasks by using AI to write Bash scripts (Aerome.net). It's still super primitive though, if I'm being honest. I think the problem is that it's way harder to write a cross platform file manager, or browser wrapper in your case, then it is to write a chat interface on top of ChatGPT. I suspect in a year or two many good use cases will emerge, as people write more complicated software to take advantage of LLM's capabilities.

I'm going to check out you're browser thing later tonight, it looks good!

golergka · on June 30, 2023

The fact that this tech gave us so many low hanging fruit is the testament of how powerful it is.

jasfi · on June 30, 2023

I'm working on something to help you code with LLMs with projects of any size. If you go to https://inventai.xyz, sign-in with your email, you'll be notified on release.

I started off with something to create AI art, but that didn't really take off. I'm also disappointed with Dall-E 2 lagging behind the others in terms of image quality. So now I'm focusing on code generation.

akiselev · on June 29, 2023

> Along the way, we’ve learned some useful lessons on how to build these kinds of applications, which we’ve formulated in terms of patterns.

    * Use a text template to enrich a prompt with context and structure
    * Tell the LLM to respond in a structured data format
    * Stream the response to the UI so users can monitor progress
    * Capture and add relevant context information to subsequent action
    * Allow direct conversation with the LLM within a context.
    * Tell LLM to generate intermediate results while answering
    * Provide affordances for the user to have a back-and-forth interaction with the co-pilot
    * Combine LLM with other information sources to access data beyond the LLM's training set

manojlds · on June 29, 2023

The short courses from dl.ai are better at driving these points - https://www.deeplearning.ai/short-courses/

frankgrecojr · on June 29, 2023

> Stream the response to the UI so users can monitor progress

This is a game changer to the UX

senko · on June 29, 2023

It's a crutch to minimize the user annoyance at having to wait up to a minute for the response. It sure beats the spinner but it's still a crutch.

thelittleone · on June 30, 2023

Really? I feel like in most uses the stream is close to, if not slightly faster than my reading. I actually prefer that over an instant full-page response. It helps me keep my place in the text and feels like reduced cognitive load.

golergka · on June 30, 2023

Also, the vibrations when LLM “types” in OpenAI iOS app are so satisfying for some reason. It may be a stupid subjective thing, but from my experience in game dev, that's exactly the kind of detail that creates the overall user experience feeling.

behnamoh · on June 29, 2023

Actually, it's annoying because as you start reading the first lines, the content keeps scrolling (often with jagged movements). I always have to scroll up immediately after the stream begins to disable this behavior.

tobr · on June 29, 2023

That’s totally fixable, though. ReadyRunner handles it simply by scrolling all the way from the start, leaving space for the message to grow.

trafnar · on June 29, 2023

Hey, that's my app! https://www.readyrunner.ai

darkteflon · on June 30, 2023

That looks really nice. I do wish you would consider one-time pricing for those bringing their own API key on the “Dev” plan, though :) I’d pay ~$20-30 for a nice desktop app like this but won’t enter into another $48/year sub.

jamifsud · on June 29, 2023

Anyone know of any good “tolerant” JSON parsers? I’d love to be able to stream a JSON response down to the client and have it be able to parse the JSON as it goes and handle the formatting errors that we sometimes see.

icyfox · on June 29, 2023

There's no bulletproof solution to this. JSON5 (https://www.npmjs.com/package/json5) gets you slightly more leniency, as does plugging the currently streamed content into another smaller LLM. I also wrote a deterministic parser more tailored towards these partially-complete LM outputs. Not perfect certainly but handles the 99% of cases well: https://github.com/piercefreeman/gpt-json. In particular the "streaming" functionality here might be of interest to you.

darkteflon · on June 30, 2023

This looks really cool, thanks for open-sourcing this. I’ve been similarly parsing and validating output from OpenAI’s new functions using a schema defined on a custom Pydantic class, but I can see that your code has a lot of niceties coming from proper battle-testing, including elegant error handling, transformations and good docs.

I’d like to incorporate this in a production workflow for generating schema-compliant test data for use in few-shot promoting - would you mind saying a few words about your medium term plans for the library? The LangChain API is changing all the time at the moment so we’re trying to figure out where it’s safe to stand. No expectations, of course, just curious.

icyfox · on June 30, 2023

Sure - I'm using it in a few different internal tools and know others are using it in production. The API should be relatively stable at this point since I intentionally kept the scope pretty limited. The main changes over time will be improved robustness and error correction as issues report different JSON schema breaks that we can fix automatically. Let me know if you see more cases that can be addressed here, would love to collaborate on it.

darkteflon · on June 30, 2023

Thanks! Absolutely, will do. I’ll have a play with it today and reach out with a PR any time it makes sense to do so.

I noticed some occasional funkiness from GPT-4 around sending back properly formatted dates yesterday but haven’t yet dug into it properly. Might be a good candidate for a transformation.

anentropic · on June 30, 2023

> I’d love to be able to stream a JSON response down to the client and have it be able to parse the JSON as it goes

why though?

jamifsud · on June 30, 2023

In a non chat setting where the LLM is performing some reasoning or data extraction it allows you to get JSON directly from the model and stream it to the UI (updating the associated UI fields as new keys come in) while caching the response server side in the exact same JSON format. It’s really simplified our stream + cache setup!

n4te · on June 30, 2023

My JsonReader in libgdx and JsonBeans can parse a more relaxed version of JSON. It uses Ragel.

huydotnet · on June 29, 2023

Still not a reasonable way if you're expecting a structured data in the response, like JSON or something that you're required to parse before showing to the user.

jelled · on June 30, 2023

On one of my apps I ask for a plain text response first and then make a second call to parse the original response to json.

m3kw9 · on June 29, 2023

LLM latency is a huge no go for most apps except for chat apps. I’ve try to build apps based on OpenAI and that itself creates a bad experience no matter how much elevator music/mirrors/spinners you place. Then you need proper error correction when dealing with structured responses/occasional hallucinations

PUSH_AX · on June 30, 2023

Latency is acceptable when the value of the result outweighs the time you believe to be excess.

If AI can build me a useful marketing plan in 15 mins with multiple agents doing work this seems fine to me. It’s going to take much longer to get a human involved.

m3kw9 · on June 30, 2023

You are assuming it gets you the result first time, it rarely does and you always need many back and forths. It’s still ok in your case if the results can kick start something, but you’d be nuts to assume you can use that all the way

PUSH_AX · on June 30, 2023

Ultimately the same concept applies in my opinion, ROI for both effort and time to be clear. Even today it's in a good place for a lot of things, and will only get better.

daviding · on June 29, 2023

This is an interesting article, and a bit of a mish mash of UI conventions, application use ideas for GPT and actual patterns for LLMs. I really do miss Martin Fowler's actual take on these things, but using his name as some sort of gestalt brain for Thoughtworks works too.

It still feels like a bit of a Wild West for patterns in this area as yet, with a lot of people trying lots of things and it might be too soon for defining terms. A useful resource is still things like the OpenAI Cookbook, that is a decent collection of a lot of the things in this article but with a more implementation bent.[1]

The area that seems to get a lot of idea duplication currently is in providing either a 'session' or a longer term context for GPT, be it with embeddings or rolling prompts for these apps. The use of vector search and embedded chunks is something that seems to be missing so far from vendors like OpenAI, and you can't help but wonder that they'll move it behind their API eventually with a 'session id' in the end. I think that was mentioned as on their roadmap for this year too. The lack of GPT-4 fine tuning options just seems to push people more to look at the Pinecone, Weaviates etc stores and chaining up their own sequences to achieve some sort of memory.

I've implemented features with GPT-4 and functions and so far it's feeling useful for 'data model' like use (where you're bringing json into the prompt about a domain noun, e.g. 'Tasks') but is pretty hairy when it comes to pure functions - the tuning they've done to get it to pick which function and which parameters to use is still hard going to get right, which means there doesn't feel like a lot of trust that it is going to be usable. It's like there needs to be a set of patterns or categories for 'business apps' that are heavily siloed into just a subset of available functions it can work with, making it more task-specific rather than as a general chat agent we see a lot of. The difference in approach between LangChain's Chain of Thought pattern and just using OpenAI functions is sort of up in the air as well. Like I said, it still all feels like we're in wild west times, at least as an app developer.

[1] https://github.com/openai/openai-cookbook

ignoramous · on June 29, 2023

> A useful resource is still things like the OpenAI Cookbook, that is a decent collection of a lot of the things in this article

By far, the best resource I've found is the Prompt Engineering Guide: https://www.promptingguide.ai/

> you can't help but wonder that they'll move it behind their API eventually with a 'session id' in the end

For in-context learning, I think it is fair to expect 100k to 500k context windows sooner. OpenAI is already at 32k.

daviding · on June 29, 2023

> By far, the best resource I've found is the Prompt Engineering Guide: https://www.promptingguide.ai/

Agreed, that is a good resource for sure. For tooling I like https://promptmetheus.com/ but any pun name gets bonus points from me.

> For in-context learning, I think it is fair to expect 100k to 500k context windows sooner. OpenAI is already at 32k.

It has been interesting to see that window increase so quickly. For LLM context the biggest thing is the pay-per-token constraint if you don't run your own, so have to wonder if that is what will be around in the future given how this is trending? Just in terms of idempotent calls, throwing everything in context up every time seems like it makes it likely that OpenAI will encroach on the stores side as well and do sessions?

mark_l_watson · on June 29, 2023

It is interesting to see the context window size increasing. I think the time complexity on window size is quadratic - ouch!

Der_Einzige · on June 30, 2023

Can we stop calling it "in-context learning" and call it what it is, zero-shot/one-shot/few-shot prompting instead?

Learning implies that the underlying weights of the LLM changed. They didn't.

ignoramous · on June 30, 2023

It may be wrongly used, but for better or for worse, few-shot prompting is synonymous with in-context "learning" (inference?): https://www.promptingguide.ai/techniques/fewshot / https://archive.is/D4cIW

sgt101 · on June 29, 2023

I find the whole idea of adding text into text to drive a outcome pretty worrying if I have to rely on the output.

If the probability of the model spitting out something bad is 0.01% will my testing find it? Probably not.. but my users certainly will.

phillipcarter · on June 29, 2023

Well, it's a tool for ideation, not a strategy emitter. You don't rely on the output, you rely on the people who finalize and commit to a strategy.

sgt101 · on June 29, 2023

Yeah - for an application like this I get it. But no one is getting rich or shifting the dial on scientific progress with this sort of thing.

mark_l_watson · on June 29, 2023

Ideation is a prime application. At least once a week, I work out an idea for something with a long “conversation” chain staring with “I want to design a system that…” and asking for more detail, suggesting resources, etc. that said, being able to do this is not life changing for me. It just saves some time.

mercurialsolo · on June 30, 2023

Retries, answer verification systems, formatting and parsing. Reasoning and validation of reasoning with natural language is a key challenge.

Closed feedback loops to improve the quality of reasoning for LLM's requires human-in-the-loop tools.

Prompting methods are often a hit and miss as the same prompts can lead to variable quality outputs.

sandGorgon · on June 30, 2023

One of the things that strongly resonate with me is the "text templating" part. I faced the same thing when i was building my AI application. Most SDK/frameworks for writing generative AI model this as libraries (or even React UI widgets).

I think that is wrong - it is a config management problem. Think Prompts X Chains X LLMs. Your prompts wont work across everything and everything will break on model change. Coding this into ur classes is what everyone does.

Instead we pull out the prompts X chains as jsonnet code. Call it trauma & learnings from the K8s/Borg world. We have formats that have evolved as a result of millions of lines of code wrangling clusters/terraform/etc - so we decided to build a SDK over it.

that is what we did here - https://github.com/arakoodev/EdgeChains/releases/tag/0.2.0

EdgeChains is basically Generative AI prompt engineering modeled as config management. Funnily people organically build this out over 6 months of engineering AI applications. That's what Martin did with templating this as text !!

selalipop · on June 29, 2023

I worked on something very much in this vein (notionsmith.ai) and feel like I should do a write up after reading this!

I think a lot of people are learning these lessons in isolation, I do wish there was a centralized place where people working on UX-focused LLM based apps were exchanging lessons

tinco · on June 29, 2023

I think a lot of us are working heads down in isolation because we don't have a shareworthy project yet. In a week or two I think my system will be fancy enough to write a blog post about and maybe make open source.

HN has been a pretty good source of exchanging knowledge so far, every couple days or so there's a write up like this that has some new tidbits or confirmations of ideas. If everyone keeps doing that we're doing great in my opinion. Looking forward to seeing your write up on here!

ignoramous · on June 29, 2023

Things on the LLM front for utility apps are fairly nascent and by OpenAI's own admission, the current limitations are fleeting, as in, as a developer, you will soon not need the workarounds used today.

Multi-modal models are going to change things even further.