Link
Skip to main content

Extract Structured Content From Structured PDF Files

Structured and Unstructured PDF Files

It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created therefore it is not possible to convert unstructured PDF files into structured PDF files.

PDF files can contain metadata tags to preserve the structure of textual content in a PDF (this is an option when the PDF file is created). If present, JPedal provides several methods to extract text content from a PDF file. In this case, we can extract any structured text present in a PDF. If not present the output file will contain a brief message explaining no content was available

An advantage of structured PDF files is that the content can be extracted and transformed into other formats. JPedal currently supports outputting this content as either HTML, JSON, or XML.

PDF files may also include images in the structured content, these are called figures. If the figures have alt text or actualtext, these will be present in the output as well.

Extract Structured Text From PDF From the Command Line or Another Language

To extract the content as HTML

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "html"

To extract the content as JSON

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "json"

To extract the content as XML

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "xml"

To extract the content, including figures

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "xml" "figuresFolder" "jpeg"

We recommend modules, but you can still use the classpath if you want to.

Example to Access API Methods

ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.JSON);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("inputFile.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
    // A document containing all the structured content
    Document anyStructuredText = extract.getStructuredTextContent();
     
    // An array of documents containing the structured content for each page 
    Document[] anyStructuredTextPerPage = extract.getStructuredTextContentPerPage();
     
    // These methods also write out the figures (images)
    Document anymoreStructuredText = extract.getStructuredTextContentAndFigures("figuresFolder", "imageFormat");
    Document[] anymoreStructuredTextPerPage = extract.getStructuredTextContentAndFiguresPerPage("figuresFolder", "imageFormat");
}

extract.closePDFfile();

Extract Structured Text From Tagged PDF in Java

//XML is the default for extracting structured text
ExtractStructuredText.writeAllStructuredTextOutlinesToDir("inputFileOrFolder", "outputFolder");

//Extracting Structure Text with more control of the options
final String password = null; //null is used when no password required
final ErrorTracker tracker = null; //ErrorTracker implementations can be used to monitor extraction
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.JSON);
//properties.setFileOutputMode(OutputModes.HTML);
        
ExtractStructuredText.
        writeAllStructuredTextOutlinesToDir("inputFileOrFolder", password, "outputFolder", tracker, properties);

ExtractStructuredText.
        writeAllStructuredTextOutlinesAndFiguresToDir("inputFileOrFolder", password, "outputFolder", tracker, properties, "figuresFolder", "imageFormat");

This example uses the JPedal ExtractStructuredText class. ExtractStructuredText outputs an XML file for the file detailed the structured content the file contains.

Coordinates Used

The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right).

Java uses a top-left coordinate system whereas PDF uses a bottom-left system so you may need to vertically flip the output with an AffineTransform. To learn more about page coordinates in PDF files, check out this blog article.


Get started with JPedal in 3 steps

  1. Fill in the form to download the trial jar →
  2. Copy the code snippets as instructed on the next page
  3. Build your solution using our docs

Learn more about JPedal

Start Your Free Trial


Customer Downloads

Select Download