Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> LLMs can consume images,

Not very well in my experience. Last time I checked ChatGPT/DALL-E couldn't understand the its own output to know that what it had drawn was incorrect. Nor could it correct mistakes that were pointed out to it.

For example, I ask it to draw an image of a bike with rim brakes it could not, nor could it "see" that what was wrong with the brakes that it had drawn. For all intents and purposes it was just remixing the images it had been trained on without much understanding.



Generating images and consuming images are very different challenges, which for most models use entirely different systems (ChatGPT constructs prompts to DALL-E for example: https://simonwillison.net/2023/Oct/26/add-a-walrus/ )

Evaluating vision LLMs on their ability to improve their own generation of images doesn't make sense to me. That's why I enjoy torturing new models with my pelican on a bicycle SVG benchmark!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: