The PDF parser is a rule based parser which uses text co-ordinates (boundary box), graphics and font data. The PDF parser works off text layer and also offers a OCR option to automatically use OCR if there are scanned pages in your PDFs. The OCR feature is based off a modified version of tika which uses tesseract underneath.
The PDF Parser offers the following features:
* Sections and subsections along with their levels.
* Paragraphs - combines lines.
* Links between sections and paragraphs.
* Tables along with the section the tables are found in.
* Lists and nested lists.
* Join content spread across pages.
* Removal of repeating headers and footers.
* Watermark removal.
* OCR with boundary boxes