Does Google provide any "OCR as a service" API that takes advantage of their closed-source advances? I know something OCR-ish happens when you upload an image-PDF to Google Docs, but I don't remember if there's any way to get the resulting text out.
Is it possible to extract text from pre-formatted documents? Let's say I have a document issued by the government, and I am only interested in the fields that have been filled. Could I process such a document fully automated using Tesseract? Maybe some image pre-processing would be needed?
If you'd be interested in using something like this as a paid API/service and have decent volume, feel free to contact me. I'm currently in pre-beta for API rollout of such a thing.
In summary, you need templates that map the field positions -> meaningful keys so that you can get back useful data as json/csv/xml. I have some tools that are still being polished that automate much of the template creation and do a lot of the pre-proc for you.
"Maybe some image pre-processing would be needed?"
No, a whole lot of pre-processing would be needed. It all depends on the exact layout - if your tolerances are tight you need much more logic than if you have, let's say, 2cm white space around one sentence you're after.
If you want to do text extraction, look at things like Stroke Width Transform to extract regions of text before passing them to Tesseract.