Link
Skip to main content

Extract Structured Content as JSON

JPedal supports extracting content as JSON from structured PDF files.

Structured and Unstructured PDF Files

A PDF file may be tagged, contain structured text, or contain marked content. These terms are often used interchangeably, but in practice they refer to PDF files which contain information about the page structure and its elements. In this article we will refer to these as ‘structured’ PDF files.

Structured PDF files contain metadata tags (similar to HTML) to preserve the structure of textual content in a PDF. A PDF is created structured or unstructured, therefore it is generally not feasible to convert unstructured PDF files into structured PDF files.

An advantage of structured PDF files is that the content can be extracted and transformed into other formats. JPedal currently supports outputting this content as either EPUB, HTML, JSON, Markdown, XML, or YAML.

JPedal can extract all structured text present in a PDF file. If there is none present, the output file will contain a brief message explaining no content was available

PDF files may also include images in the structured content, these are called figures. If the figures have alt text or actualtext, these will be present in the output as well. If you are using the Java API methods, you may provide a scaling value for the images.

Extract Structured Content as JSON from a PDF in Java

// Options
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.JSON); // output format        

// Extract structured text as JSON
ExtractStructuredText.writeAllStructuredTextOutlinesToDir(
        "inputFileOrFolder", // a single PDF of a folder containing PDFs
        "password",          // the password or null if not required
        "outputFolder",      // the output folder for the JSON
        null,                // error callback
        properties           // our settings object
);

// Extract structured text and images as JSON
ExtractStructuredText.writeAllStructuredTextOutlinesAndFiguresToDir(
        "inputFileOrFolder", // a single PDF of a folder containing PDFs
        "password",          // the password or null if not required
        "outputFolder",      // the output folder for the JSON
        null,                // error callback
        properties           // our settings object
        "figuresFolder",     // the output folder for the images
        "imageFormat",       // the format for the images
        1.0f                 // the scaling for the images
);

Extract Structured Content as JSON from a PDF from the Command Line or Another Language

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText 
"inputFileOrFolder" "outputFolder" "json"

We recommend modules, but you can still use the classpath if you want to.

Javadocs

This example uses the JPedal ExtractStructuredText class.


Why JPedal?

  • Actively developed commercial library with full support and no third party dependencies.
  • Process PDF files up to 3x faster than alternative Java PDF libraries.
  • Simple licensing options and source code access for OEM users.

Learn more about JPedal

Start Your Free Trial


Customer Downloads

Select Download