Thank you. Even with box layout one can not even know that there is a coherent w...

throwaway4496 · 2025-08-06T11:44:37 1754480677

What kind of semantics can you infer from the text of OCRing a bitmap that you can't infer from the text generated directly from the PDF? Is it the lack of OCR mistakes? The hallucinations? Or something else?

dotancohen · 2025-08-06T20:41:30 1754512890

In the cases that I've seen, the PDF software does not generate text strings. It generates individual letters. It is up to any application to try to figure out how those individual letters relate to one another.

throwaway4496 · 2025-08-06T23:47:21 1754524041

Did you even read my comment? The "application" is called pdftotext, and instead of putting the individual letters on a bitmap, it puts them in a string literal.

dotancohen · 2025-08-07T11:18:52 1754565532

I do not understand why you insist on being polemic to win an internet argument, when I'm giving you all the tools to win the internet argument by virtue of being correct.

I did read your comment, because my intention here is to learn. I already described how tools such as pdftotext do not produce strings when each letter is positioned independently. I even gave an example of a few replies up.