Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

iirc "Interrogate CLIP" is a bit of a misnomer - what it's actually doing is generating a basic caption with BLIP ("a woman holding a pencil"), then iterating over categories and checking with CLIP if any items in those categories are depicted in that image, then concatenating any hits to the resulting caption.

This means the resulting caption is of the form "[BLIP caption], [category1 item], [category2 item], ...". It's very rudimentary.

To clarify: CLIP can tell you if a text label matches an image. It can't generate a caption by itself.

There are more advanced captioning methods, but I'm not sure if they're exposed in A1111 (I haven't used it in some months)



Thank you for this! I've always been confused about BLIP vs CLIP. That makes a lot of sense and explains the weird duplication of a noun I see sometimes "A woman woman" things like that.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: