iirc "Interrogate CLIP" is a bit of a misnomer - what it's actually doing is generating a basic caption with BLIP ("a woman holding a pencil"), then iterating over categories and checking with CLIP if any items in those categories are depicted in that image, then concatenating any hits to the resulting caption.
This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.
To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.
There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)
Thank you for this! I've always been confused about BLIP vs CLIP. That makes a lot of sense and explains the weird duplication of a noun I see sometimes "A woman woman" things like that.
This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.
To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.
There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)