> Especially in a chatbot use case with cumulative prompting, which is the best use case for such a large context vs. the default cheaper 8k window.
Depends on what is up with the images and how they translate into tokens. I really have no idea, but could be that 32k tokens (lots of text) translates to only a few images for few-shot prompting.
The paper seems not to mention image tokenization, but I guess it should be possible to infer something about token rate when actually using the API and looking at how one is charged.
Currently, CLIP's largest size is at patch-14 for 336x336 images, which translates to 577 ViT tokens [(336/14)^2+1]. It might end up being token-efficient depending on how it's implemented. (the paper doesn't elaborate)
Depends on what is up with the images and how they translate into tokens. I really have no idea, but could be that 32k tokens (lots of text) translates to only a few images for few-shot prompting.
The paper seems not to mention image tokenization, but I guess it should be possible to infer something about token rate when actually using the API and looking at how one is charged.