Link
Skip to main content

Extract words on a page from any PDF file

JPedal provides several methods to extract text content from a PDF file. In this case, we can extract single words and their coordinates from a file.

Extract Words from PDF from Command Line or another language

java --module-path . --add-modules com.idrsolutions.jpedal org/jpedal/examples/text/ExtractTextAsWordlist "inputFileOrFolder" "outputFolder"

We recommend modules, but you can still use the classpath if you want to.

Example to access API methods

ExtractTextAsWordlist extract = new ExtractTextAsWordlist("inputFile.pdf");
//extract.setPassword("password");
if (extract.openPDFFile()) {
    int pageCount = extract.getPageCount();
    for (int page = 1; page <= pageCount; page++) {
        List wordList = extract.getWordsOnPage(page);
    }
}

extract.closePDFfile();

Extract Words from PDF in Java

ExtractTextAsWordList.writeAllWordlistsToDir("inputFileOrFolder", "outputFolder", -1);

This example uses the JPedal ExtractTextAsWordlist class. ExtractTestAsWordlist outputs a txt file per page, each line of a file is a comma-separated string containing the word, x1, y1, x2, y2 values for the coordinates.

Coordinates Used

The coordinates used in the return value are defined by the four values defined as x1, y1, x2, y2 which are the left, top, right and bottom values on the PDF page. On a PDF page, the origin in the bottom leftmost corner of the page.


Why JPedal?

  • Actively developed commercial library with full support and no third party dependencies.
  • Process PDF files up to 3x faster than alternative Java PDF libraries.
  • Simple licensing options and source code access for OEM users.

Learn more about JPedal

Start Your Free Trial


Customer Downloads

Select Download