Extract Unstructured text with a rectangle from PDF files
Structured and Unstructured PDF files
It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.
JPedal provides several methods to extract text content from a PDF file. In this case, we can extract all text from within a specified rectangle
Extract Text from PDF from Command Line or another language
java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractTextInRectangle
"inputFileOrFolder" "outputFolder"
We recommend modules, but you can still use the classpath if you want to.
Example to access API methods with XML structure for text
ExtractTextInRectangle extract = new ExtractTextInRectangle("inputFile.pdf");
extract.setOutputFormat(OUTPUT_FORMAT.XML);
//extract.setEstimateParagraphs(true); //Estimate paragraphs in document
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount = extract.getPageCount();
for (int page = 1; page <= pageCount; page++) {
String text = extract.getTextOnPage(page);
//alternative
//Co-ordinates are x1, y1 (top left hand corner), x2, y2 (bottom right)
//String text = extract.getTextOnPage(page, x1, y1, x2, y2);
}
}
extract.closePDFfile();
Example to access API methods with plain text
ExtractTextInRectangle extract = new ExtractTextInRectangle("inputFile.pdf");
extract.setOutputFormat(OUTPUT_FORMAT.TXT);
//extract.setEstimateParagraphs(true); //Estimate paragraphs in document
//extract.setPassword("password");
if (extract.openPDFFile()) {
int pageCount = extract.getPageCount();
for (int page = 1; page <= pageCount; page++) {
String text = extract.getTextOnPage(page);
//alternative
//Co-ordinates are x1, y1 (top left hand corner), x2, y2 (bottom right)
//String text = extract.getTextOnPage(page, x1, y1, x2, y2);
}
}
extract.closePDFfile();
Extract Text from PDF in Java
//Extracting plain text
ExtractTextInRectangle.writeAllTextToDir("inputFileOrFolder", "password_or_null", "outputFolder", -1, OUTPUT_FORMAT.TXT, false);
//Extracting text as XML
ExtractTextInRectangle.writeAllTextToDir("inputFileOrFolder", "password_or_null", "outputFolder", -1, OUTPUT_FORMAT.XML, false);
This example uses the JPedal ExtractTextInRectangle class. ExtractTextInRectangle outputs a txt file per page, each containing all the text extracted from the page.
Extraction Output
When extracting output it is important to note that text on a PDF page may not be structured. Unstructured documents contain no details about how the content is laid out or added to the page. We have come across cases where all the letter ‘a’ on the page are added, then all the letter ‘b’, and so on.
In order to handle this ExtractTextInRectangle will attempt to order this content as it appears in the rectangle when it is extracted.
Text Orientation
If a page (or selected region) contains text with multiple orientations (HORIZONTAL_LEFT_TO_RIGHT
, HORIZONTAL_RIGHT_TO_LEFT
, VERTICAL_TOP_TO_BOTTOM
, VERTICAL_BOTTOM_TO_TOP
), only the text in the most common orientation will be extracted and the other text is ignored.
Coordinates Used
The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).