For a task like that, I'd recommend LLaVA instead. It's still inaccurate, but it...

For a task like that, I'd recommend LLaVA instead. It's still inaccurate, but it's a great deal more accurate than the other options I've tried. It also works with llama.cpp.

LLaVA is a multimodal language model you ask questions of. If you don't provide a question, then the default is "Describe this picture in detail". But if you have a concrete question, you're likely to get better results. You can also specify the output format, which often works.

(Make sure to use --temp 0.1, the default is far too high.)

It runs very slowly on CPU, but will eventually give you an answer. If you have more than about four-five pictures to caption, you probably want to put as many as possible as the layers on the GPU. This requires specific compilation options for CUDA; on an M1/M2 it's possible by default, but still needs to be turned on. (-ngl 9999)