Link

Extract Structured content from structured PDF files

Structured and Unstructured PDF files

It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.

PDF files can contain metadata tags to preserve the structure of textual content in a PDF (this is an option when the PDF file is created). If present, JPedal provides several methods to extract text content from a PDF file. In this case, we can extract any structured text present in a PDF. If not present the output file will contain a brief message explaining no content was available

An advantage of structured PDF files is that the content can be extracted and transformed into other formats. JPedal current support outputting this content as either XML or HTML content.

Extract Structured Text from PDF from Command Line or another language

To extract the content as XML

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrDir" "outputDir" "xml"

To extract the content as HTML

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrDir" "outputDir" "html"

We recommend modules from Java 11 onwards, if you are using an older version you will have to use the classpath.

Example to access API methods

 ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
 properties.setFileOutputMode(OutputModes.XML);
 //properties.setFileOutputMode(OutputModes.HTML);
 ExtractStructuredText extract = new ExtractStructuredText("C:/pdfs/mypdf.pdf", properties);
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     Document anyStructuredText = extract.getStructuredTextContent();
 }

 extract.closePDFfile();

Extract Structured Text from tagged PDF in Java

//Extracting Structured Text as XML is the default
ExtractStructuredText.
writeAllStructuredTextOutlinesToDir("inputFileOrDirectory", "outputDir");

//Extracting Structure Text with more control of the options
final String password = null; //null is used when no password required
final ErrorTracker tracker = null; //ErrorTracker implementations can be used to monitor extraction
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
        
ExtractStructuredText.
writeAllStructuredTextOutlinesToDir("inputFileOrDirectory", password, "outputDir", tracker, properties);

This example uses the JPedal ExtractStructuredText class. ExtractStructuredText outputs an XML file for the file detailed the structured content the file contains.

Coordinates Used

The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).