To add some numbers, on MBP M1 64GB with ggml-org/gemma-3-4b-it-GGUF I get 25t/s...

matja · 2025-05-10T15:39:34 1746891574

For every image I try, I get the same response:

> This image shows a diverse group of people in various poses, including a man wearing a hat, a woman in a wheelchair, a child with a large head, a man in a suit, and a woman in a hat.

No, none of these things are in the images.

I don't even know how to begin debugging that.

clueless · 2025-05-10T19:19:52 1746904792

I get the same as well, instead I get this message, no matter which image I upload: "This is a humorous meme that uses the phrase "one does not get it" in a mocking way. It's a joke about people getting frustrated when they don’t understand the context of a joke or meme."

Not sure why it's not working

clueless · 2025-05-10T19:40:36 1746906036

Ok, following the following comment in this thread fixed the issue: https://news.ycombinator.com/item?id=43943624

exe34 · 2025-05-10T16:03:28 1746893008

Means it can't see the actual image. It's not loading for some reason.

aendruk · 2025-05-10T18:28:38 1746901718

I’m having a hard time imagining how failure to see an image would result in such a misleadingly specific wrong output instead of e.g. “nothing” or “it’s nonsense with no significant visual interpretation”. That sounds awful to work with.

sigmaisaletter · 2025-05-10T19:42:38 1746906158

LLMs have a very hard time saying "I am useless in this situation", because they are explicitly trained to be a helpful assistant.

So instead of saying "I can't help you with this picture", the thing hallucinates something.

That is the expected behavior by now. Not hard to imagine at all.

aendruk · 2025-05-10T19:54:46 1746906886

No controls in the training data?

tough · 2025-05-10T19:41:24 1746906084

Fun fact,you can prompt the llm's with no input and random nonsense will come out of them

exe34 · 2025-05-11T07:36:22 1746948982

And if you set the temperature to zero, you'll get the same output every time!

brrrrrm · 2025-05-10T20:09:36 1746907776

hmm, I'm getting the same results - but I see on M1 with a 7b model we should expect ~10x faster prompt processing

https://github.com/ggml-org/llama.cpp/discussions/4167

I wonder if it's the encoder that isn't optimized?

zamadatix · 2025-05-10T12:39:13 1746880753

Are those numbers for the 4/8 bit quants or the full fp16?

dust42 · 2025-05-10T13:59:18 1746885558

It is a 4-bit quant gemma-3-4b-it-Q4_K_M.gguf. I just use "describe" as prompt or "short description" if I want less verbose output.

As you are a photographer, using a picture from your website gemma 4b produces the following:

"A stylish woman stands in the shade of a rustic wooden structure, overlooking a landscape of rolling hills and distant mountains. She is wearing a flowing, patterned maxi dress with a knotted waist and strappy sandals. The overall aesthetic is warm, summery, and evokes a sense of relaxed elegance."

This description is pretty spot on.

The picture I used is from the series L'Officiel.02 (L-officel_lanz_08_1369.jpg) from zamadatix' website.

zamadatix · 2025-05-10T18:42:02 1746902522

I'm can neither claim to be a photographer nor that https://www.dansmithphotography.com/ my website, but I appreciate the example! The specific photo for other's reference, based on the filename: https://payload.cargocollective.com/1/15/509333/14386490/L-o...

That said I'm not as impressed of the description. The structure has some wood but it's certainly not just wooden, there are distant mountains but not much in the way of rolling hills to speak of. The dress is flowing but the waist is not knotted - the more striking note might have been the sleeves.

For 4 GB of model I'm not going to ding it too badly though. The question on which quant was mainly around the tokens/second angle (q4 requires 1/4th the memory bandwidth as the full model would) rather than quality angle. As a note: a larger multimodal model gets all of these points accurately (e.g. "wooden and stone rustic structure"), they aren't just things I noted myself.

refulgentis · 2025-05-10T17:17:58 1746897478

n.b. the image processing is by a separate model, basically has to load the image and generate ~1000 tokens

(source: vision was available in llama.cpp but Very Hard, been maintaining an implementation)

(n.b. it's great work, extremely welcome, and new in that the vision code badly needed a rebase and refactoring after a year or two of each model adding in more stuff)

brrrrrm · 2025-05-10T21:49:48 1746913788

wait sorry, can you explain how this works? I thought gemma3 used siglip, which can output all 256 embeddings in parallel

(also, would you mind sharing a code pointer if you have any handy? I found this https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd... but not sure if that's the codepath taken)

astrodude · 2025-05-10T16:39:15 1746895155

do you have any example images it generated based on your prompts?

want to have a look before I try

geoffpado · 2025-05-10T19:11:47 1746904307

To be clear, this model isn't generating images, it's describing images that are sent to it.