Link

Extract text using Apache Tika

v2023.05

JPedal is compatible with Apache Tika’s Parser interface, meaning you can use it as a drop-in replacement in an existing Tika application.

Currently, it supports both structured and unstructured text.

Example usage

try (final TikaInputStream tik = TikaInputStream.get(Paths.get("inputFile.pdf"))) {
    final PDFParser parser = new PDFParser(UNSTRUCTURED_TEXT);

    // Set the writeLimit to -1 otherwise only the first 100000 characters are parsed
    final BodyContentHandler handler = new BodyContentHandler(-1);

    // Ability to set a password if necessary
    final Metadata metadata = new Metadata();
    // metadata.set(PDFParser.PASSWORD, "password");

    // parseContext is not required so can be null
    parser.parse(tik, handler, metadata, null);

    // Print the result
    System.out.println(handler);
} catch (final IOException | SAXException | TikaException e) {
    e.printStackTrace();
}

More information about PDFParser.


Why JPedal?

  • Actively developed commercial library with full support and no third party dependencies.
  • Process PDF files up to 3x faster than alternative Java PDF libraries.
  • Simple licensing options and source code access for OEM users.

Start Your Free Trial


Customer Downloads

Select Download