Hacker News new | past | comments | ask | show | jobs | submit login

Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.

OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.




That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.

The .NORM files (https://xkcd.com/2116)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: