JPedal provides several methods to extract text content from a PDF file. In this case, we can extract single words and their coordinates from a file.
Extract Words from PDF with the Command-Line or another language
java -cp ./jars/jpedal.jar org/jpedal/examples/text/ExtractTextAsWordlist
Extract Words from PDF in Java
writeAllWordlistsToDir("inputFileOrDirectory", "outputDir", -1);
This example uses the JPedal ExtractTextAsWordlist class. ExtractTestAsWordlist outputs a txt file per page, each line of a file is a comma-separated string containing the word, x1, y1, x2, y2 values for the coordinates.
The coordinates used in the return value are defined by the four values defined as x1, y1, x2, y2 which are the left, top, right and bottom values on the PDF page. On a PDF page, the origin in the bottom leftmost corner of the page.