To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get
25t/s prompt processing
63t/s token generation
Overall processing time per image is ~15secs, no matter what size the image is. The small 4B has already very decent output, describing different images pretty well.
Note: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.
> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.
I get the same as well, instead I get this message, no matter which image I upload:
"This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."
I’m having a hard time imagining how failure to see an image would result in such a misleadingly specific wrong output instead of e.g. “nothing” or “it’s nonsense with no significant visual interpretation”. That sounds awful to work with.
It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.
As you are a photographer, using a picture from your website gemma 4b produces the following:
"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."
This description is pretty spot on.
The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.
That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.
For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.
n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens
(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)
(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)
Steps to reproduce:
Then open http://127.0.0.1:8080/ for the web interfaceNote: if you are not using -hf, you must include the --mmproj switch or otherwise the web interface gives an error message that multimodal is not supported by the model.
I have used the official ggml-org/gemma-3-4b-it-GGUF quants, I expect the unsloth quants from danielhanchen to be a bit faster.