Link

Extract Unstructured text with a rectangle from PDF files

Structured and Unstructured PDF files

It is possible to create structured PDF files (which contain information on the page structure) or unstructured PDF files (which contain no structural information and the content can be in any order). This happens when the PDF is created and it is not possible to convert unstructured PDF files into structured PDF files.

JPedal provides several methods to extract text content from a PDF file. In this case, we can extract all text from within a specified rectangle

Extract Text from PDF from Command Line or another language

java -cp ./jars/jpedal.jar org/jpedal/examples/text/ExtractTextInRectangle 
"inputFileOrDir" "outputDir"

Example to access API methods with HTML structure for text

 ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf"); 
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {

        String text=extract.getTextOnPage(page);
        
        //alternative
        //Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right)
        //String text=extract.getTextOnPage(page, x1, y1, x2, y2);
     }
 }

 extract.closePDFfile();

Example to access API methods with plain text

 ExtractTextInRectangle extract=new ExtractTextInRectangle("C:/pdfs/mypdf.pdf", true); 
 //extract.setPassword("password");
 if (extract.openPDFFile()) {
     int pageCount=extract.getPageCount();
     for (int page=1; page<=pageCount; page++) {

        String text=extract.getTextOnPage(page);
        
        //alternative
        //Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right)
        //String text=extract.getTextOnPage(page, x1, y1, x2, y2);
     }
 }

 extract.closePDFfile();

Extract Text from PDF in Java

ExtractTextInRectangle.
writeAllTextToDir("inputFileOrDirectory", "outputDir", -1);

This example uses the JPedal ExtractTextInRectangle class. ExtractTextInRectangle outputs a txt file per page, each containing all the text extracted from the page.

Extraction Output

When extracting output it is important to note that text on a PDF page may not be structured. Unstructured documents contain no details about how the content is laid out or added to the page. We have come across cases where all the letter ‘a’ on the page are added, then all the letter ‘b’, and so on.

In order to handle this ExtractTextInRectangle will attempt to order this content as it appears in the rectangle when it is extracted.

Text Orientation

If a page (or selected region) contains text with multiple orientations (HORIZONTAL_LEFT_TO_RIGHT, HORIZONTAL_RIGHT_TO_LEFT, VERTICAL_TOP_TO_BOTTOM , VERTICAL_BOTTOM_TO_TOP), only the text in the most common orientation will be extracted and the other text is ignored.

Coordinates Used

The extraction methods all extract PDF text in a given rectangle, the required format of the coordinates of this rectangle are x1, y1 (top left-hand corner), and x2, y2 (bottom right). The page origin is bottom left (opposite to Java).