Extract Structured content from structured PDF files

Structured and Unstructured PDF files

It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.

PDF files can contain metadata tags to preserve the structure of textual content in a PDF (this is an option when the PDF file is created). If present, JPedal provides several methods to extract text content from a PDF file. In this case, we can extract any structured text present in a PDF. If not present the output file will contain a brief message explaining no content was available

Extract Structured Text from PDF from Command Line or another language

java -cp ./jars/jpedal.jar org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrDir" "outputDir"

Example to access API methods

ExtractStructuredText extract=new ExtractStructuredText("C:/pdfs/mypdf.pdf");
 if (extract.openPDFFile()) {
     Document anyStructuredText=extract.getStructuredTextContent();


Extract Structured Text from PDF in Java

writeAllStructuredTextOutlinesToDir("inputFileOrDirectory", "outputDir");

This example uses the JPedal ExtractStructuredText class. ExtractStructuredText outputs an XML file for the file detailed the structured content the file contains.

Coordinates Used

The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).