JPedal provides several methods to extract text content from a PDF file. In this case, we can extract all text from within a specified rectangle
Extract Text from PDF with the Command-Line or another language
java -cp ./jars/jpedal.jar org/jpedal/examples/text/ExtractTextInRectangle
Extract Text from PDF in Java
writeAllTextToDir("inputFileOrDirectory", "outputDir", -1);
This example uses the JPedal ExtractTextInRectangle class. ExtractTextInRectangle outputs a txt file per page, each containing all the text extracted from the page.
When extracting output it is important to note that text on a PDF page may not be structured. Unstructured documents contain no details about how the content is laid out or added to the page. We have come across cases where all the letter 'a' on the page are added, then all the letter 'b', and so on.
In order to handle this ExtractTextInRectangle will attempt to order this content as it appears in the rectangle when it is extracted.