Extract Structured content from structured PDF files
Structured and Unstructured PDF files
It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.
PDF files can contain metadata tags to preserve the structure of textual content in a PDF (this is an option when the PDF file is created). If present, JPedal provides several methods to extract text content from a PDF file. In this case, we can extract any structured text present in a PDF. If not present the output file will contain a brief message explaining no content was available
An advantage of structured PDF files is that the content can be extracted and transformed into other formats. JPedal current support outputting this content as either XML or HTML content.
Extract Structured Text from PDF from Command Line or another language
To extract the content as XML
java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText
"inputFileOrFolder" "outputFolder" "xml"
To extract the content as HTML
java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText
"inputFileOrFolder" "outputFolder" "html"
We recommend modules, but you can still use the classpath if you want to.
Example to access API methods
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText extract = new ExtractStructuredText("inputFile.pdf", properties);
//extract.setPassword("password");
if (extract.openPDFFile()) {
Document anyStructuredText = extract.getStructuredTextContent();
}
extract.closePDFfile();
Extract Structured Text from tagged PDF in Java
//Extracting Structured Text as XML is the default
ExtractStructuredText.writeAllStructuredTextOutlinesToDir("inputFileOrFolder", "outputFolder");
//Extracting Structure Text with more control of the options
final String password = null; //null is used when no password required
final ErrorTracker tracker = null; //ErrorTracker implementations can be used to monitor extraction
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.XML);
//properties.setFileOutputMode(OutputModes.HTML);
ExtractStructuredText.
writeAllStructuredTextOutlinesToDir("inputFileOrFolder", password, "outputFolder", tracker, properties);
This example uses the JPedal ExtractStructuredText class. ExtractStructuredText outputs an XML file for the file detailed the structured content the file contains.
Coordinates Used
The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).