Structured and Unstructured PDF files
It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.
How to find out if your file is a structured PDF file
There is an article on our blog with information on how to find out if your PDF file contains structured content.
Extract Structured content from structured PDF files
- ExtractStructuredContent – View Javadoc for API to extract any Structured content (if not present – Documents without structure will return no data).
Extract Unstructured content from unstructured PDF files
- ExtractTextInRectangle – View Javadoc for API to extract text from any rectangular area of the PDF page.
Extract words on a page and document outline from any PDF file
- ExtractTextAsWordlist – View Javadoc for API to generate a list of words on the PDF page with page co-ordinates.
- ExtractOutline – View Javadoc for API to extract the PDF outline tree from a PDF file (if present) as an XML structure.
The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).