I'm running a large scale object detection/classification and ocr pipeline at the moment, figuring out the properties of all doorbells, mailboxes and house number signs in an european country (don't ask lmao).
This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.
This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.