Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tesseract is ok, but I gather that a lot of the good work in the last few years on it has remained closed source within Google.

If you want to do text extraction, look at things like Stroke Width Transform to extract regions of text before passing them to Tesseract.



If you are using OpenCV and tesseract, you might have a look at Scene Text Detection in OpenCV 3. It's in the text module within OpenCV_contrib [0].

There are samples here [1] and here [2] to get you started. The paper is here: [3]

---

[0] http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.ht...

[1] https://github.com/Itseez/opencv_contrib/blob/master/modules...

[2] https://github.com/Itseez/opencv_contrib/blob/master/modules...

[3] http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf


Does Google provide any "OCR as a service" API that takes advantage of their closed-source advances? I know something OCR-ish happens when you upload an image-PDF to Google Docs, but I don't remember if there's any way to get the resulting text out.



Is it possible to extract text from pre-formatted documents? Let's say I have a document issued by the government, and I am only interested in the fields that have been filled. Could I process such a document fully automated using Tesseract? Maybe some image pre-processing would be needed?


If you'd be interested in using something like this as a paid API/service and have decent volume, feel free to contact me. I'm currently in pre-beta for API rollout of such a thing.

In summary, you need templates that map the field positions -> meaningful keys so that you can get back useful data as json/csv/xml. I have some tools that are still being polished that automate much of the template creation and do a lot of the pre-proc for you.

email is my username (at) gmail


"Maybe some image pre-processing would be needed?"

No, a whole lot of pre-processing would be needed. It all depends on the exact layout - if your tolerances are tight you need much more logic than if you have, let's say, 2cm white space around one sentence you're after.


(Disclaimer: I work for creatale GmbH)

We open-sourced the library that we use for exactly that purpose: https://github.com/creatale/node-fv




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: