There's no run to any OCR, first step or not. And you have no idea what you're t...

llm_nerd · on Feb 22, 2024

You understand that OCR is the process of extracting text from images, right? You know, such as what Gemini does, and they reference repeatedly in their paper. I have absolutely no idea why you repeatedly make some bizarre distinction about it being a "separate process".

Okay, it's been fun talking to you but feel free to have the last word. Good luck.

og_kalu · on Feb 22, 2024

The transformer (Gemini) predicts text with image and text in the context window. That's it.

OCR, Object detection etc all come from the transformer predicting text. Read the Flamingo paper.