Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

bob1029 82 days ago | parent | context | favorite | on: Show HN: HTML visualization of a PDF file's intern...

Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.

OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.

themanmaran 82 days ago [–]

That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.

The .NORM files (https://xkcd.com/2116)

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact