Extract Text from PDF using Java

Introduction
Prerequisites
Code Example
Return result to a callback url
Configuration Options
Upload by URL
Using Authentication
Further details

Introduction

The following tutorial shows you how to extract text from PDFs using a hosted JPedal cloud API. You can set up your own self-hosted JPedal microservice

Whilst the above service can be accessed with plain old HTTP requests, this tutorial uses our open source Java IDRCloudClient which provides a simple Java wrapper around the REST API.

Prerequisites

Before you begin you will need to ensure you have an up-to-date version of the JDK(version 8 or above) installed. You can find more on this on Java’s website.

Code Example

Here is a basic code example to extract text from PDFs. Configuration options and advanced features can be found below.

import java.util.Map;

public final class ExampleUsage {

    public static void main(final String[] args) {

        final IDRCloudClient client = new IDRCloudClient("https://my-self-hosted-service.com/" + IDRCloudClient.JPEDAL);

        
        final HashMap<String, String> params = new HashMap<>();
        params.put("input", IDRCloudClient.UPLOAD);
        params.put("file", "path/to/file.pdf");
        params.put("settings","{\"mode\":\"extractText\",\"type\":\"plainText\"}");
        try {
            final Map<String, String> results = client.convert(params);

            System.out.println("   ---------   ");
            System.out.println(results.get("previewUrl"));

            IDRCloudClient.downloadResults(results, "path/to/outputDir", "example");
        } catch (final ClientException | InterruptedException e) {
            e.printStackTrace();
        }
    }
} 

Return result to a callback url

The JPedal Microservice accepts a callback url to send the status of a extraction on completion. Using a callback url removes the need to poll the service to determine when the extraction is complete.
The callback url can be provided to the params map as shown below.

final HashMap<String, String> params = new HashMap<>();
params.put("input", IDRCloudClient.UPLOAD);
params.put("file", "path/to/file.pdf");
params.put("callbackUrl", "http://listener.url");
params.put("settings","{\"mode\":\"extractText\",\"type\":\"plainText\"}"); 

Configuration Options

The JPedal API accepts a stringified JSON object containing key value pair configuration options to customise your extraction. The settings should be added to the parameters array. A full list of the configuration options to extract text from PDFs can be found here.

params.put("settings", "{\"key\":\"value\",\"key\":\"value\"}");

Upload by URL

As well as uploading a local file you can also provide a URL which the JPedal Microservice will download and then perform the extraction. To do this you should replace the input and file values in the parameters variable with the following.

params.put("input", IDRCloudClient.DOWNLOAD);
params.put("url", "http://exampleURL/exampleFile.pdf");

Using Authentication

If the JPedal Microservice requires authentication, you will need to provide a username and password. These are provided by passing two variables named username and password to the convert method as shown below.

params.put("username","yourUsername");
params.put("password","yourPassword");

Further details

IDRCloudClient on GitHub JPedal Microservice API
JPedal Microservice Use

Extract Text from PDF using Java

Table of contents

Introduction

Prerequisites

Code Example

Return result to a callback url

Configuration Options

Upload by URL

Using Authentication

Further details

More resources

Developer Discord

Zoom Call

Create Ticket

JPedal Licensing

Why JPedal?

Start Your Free Trial

Customer Downloads