Extract Structured Content as Markdown
JPedal supports extracting content as MARKDOWN from structured PDF files.
Structured and Unstructured PDF Files
A PDF file may be tagged, contain structured text, or contain marked content. These terms are often used interchangeably, but in practice they refer to PDF files which contain information about the page structure and its elements. In this article we will refer to these as ‘structured’ PDF files.
Structured PDF files contain metadata tags (similar to HTML) to preserve the structure of textual content in a PDF. A PDF is created structured or unstructured, therefore it is generally not feasible to convert unstructured PDF files into structured PDF files.
An advantage of structured PDF files is that the content can be extracted and transformed into other formats. JPedal currently supports outputting this content as either EPUB, HTML, JSON, Markdown, XML, or YAML.
JPedal can extract all structured text present in a PDF file. If there is none present, the output file will contain a brief message explaining no content was available
PDF files may also include images in the structured content, these are called figures. If the figures have alt text or actualtext, these will be present in the output as well. If you are using the Java API methods, you may provide a scaling value for the images.
Extract Structured Content as MARKDOWN from a PDF in Java
// Options
ExtractStructuredTextProperties properties = new ExtractStructuredTextProperties();
properties.setFileOutputMode(OutputModes.MARKDOWN); // output format
// Extract structured text as MARKDOWN
ExtractStructuredText.writeAllStructuredTextOutlinesToDir(
"inputFileOrFolder", // a single PDF of a folder containing PDFs
"password", // the password or null if not required
"outputFolder", // the output folder for the MARKDOWN
null, // error callback
properties // our settings object
);
// Extract structured text and images as MARKDOWN
ExtractStructuredText.writeAllStructuredTextOutlinesAndFiguresToDir(
"inputFileOrFolder", // a single PDF of a folder containing PDFs
"password", // the password or null if not required
"outputFolder", // the output folder for the MARKDOWN
null, // error callback
properties // our settings object
"figuresFolder", // the output folder for the images
"imageFormat", // the format for the images
1.0f // the scaling for the images
);
Extract Structured Content as MARKDOWN from a PDF from the Command Line or Another Language
java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractStructuredText
"inputFileOrFolder" "outputFolder" "markdown"
We recommend modules, but you can still use the classpath if you want to.
Javadocs
This example uses the JPedal ExtractStructuredText class.