https://code.google.com/p/tesseract-ocr/ is pretty good

yoda_sl · on June 25, 2015

Tesseract is quite good and can get some really good results if you can train it. I just wish there was a more UI friendly way for generating the training file; for the project where I had to use it I ended up paying a freelancer with some strong knowledge of Tesseract training. Really happy with the results for my iOS app (universal app)

WalterGR · on June 25, 2015

Tesseract does no layout analysis.

So if the source image contains text columns or pull quotes or similar, the output text will just be each row of text, from the far left to the far right.

jahewson · on June 25, 2015

Take a look at https://github.com/tmbdev/ocropy for layout analysis (at one point this project was called OCRopus).

NeutronBoy · on June 25, 2015

Could you add another layer on top (eg. image processing) to detect boundaries of blocks of content, and send each block to Tesseract in sequence?

kefka · on June 25, 2015

Yeah, you could do that. And I wouldn't think it would be that hard either.

I'd make a cascade that detects all letters and numbers from major font sets. That shouldn't be too terribly difficult.

Now, use the cascade to scan the document. Now, convert the document to a list of all detected characters (we don't actually care what the chars are).

Once you have this, do best fit bounding boxes around the data. You'll have to figure out what distance you want to exclude from the bounding boxes.

Now what you should end up with are a few boxes indicating the regions of data on the document. Now, crop each of these regions of interest and feed them into Tesseract.