Link

Extract Text from PDF using cURL

Table of contents

  1. Introduction
  2. Prerequisites
  3. Code Example
  4. Return result to a callback url
  5. Configuration Options
  6. Upload by URL
  7. Using Authentication
  8. Further details

Introduction

The following tutorial shows you how to extract text from PDFs using a hosted JPedal cloud API. You can set up your own self-hosted JPedal microservice
In the examples below we will use https://my-self-hosted-service.com/JPedal for the URL, but you should replace this with the URL for you self-hosted service.

Whilst the above service can be accessed with cURL using the REST API.

Prerequisites

Before you begin you will need to ensure cURL is installed. The set up varies based on your operating system, more details can be found on the curl website.

Code Example

Here is a basic code example to extract text from PDFs.
Please note the file entry must be ‘@’ followed by the path (absolute or relative) to the file.
Configuration options and advanced features can be found below.

curl -X POST -F input="upload" -F file="@/path/to/file/myfile.pdf" -F settings="{\"mode\":\"convertToImages\",\"format\":\"png\"}" https://my-self-hosted-service.com/jpedal

Note: the format of settings is different depending on which platform you are on. See configuration options below.

The response will be in JSON format containing a uuid.

 {"uuid" : "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"}

You can use to poll the progress of your extraction and retrieve the URL for the output once the extraction is complete.

curl https://my-self-hosted-service.com/jpedal?uuid=aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa

The response will be in JSON format and provided the following details.

 {
     "state" : "processed", 
     "downloadUrl" : "output/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa/myfile.zip",
     "previewUrl" : "output/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa/myfile/index.html"
 }

You can use the previewURL to preview the output in your browser.

You may also download the converted output using the download URL. This can be done with the following cURL request.

# Download the file to the current directory as it is named, in this case "myfile.zip"
curl https://my-self-hosted-service.com/jpedal/output/aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa/myfile.zip -LO --output "output.zip"

Return result to a callback url

The JPedal Microservice accepts a callback url to send the status of a extraction on completion. Using a callback url removes the need to poll the service to determine when the extraction is complete.
The callback url can be provided as shown below.

curl -X POST -F input="upload" -F callbackUrl="http://listener.url" -F file="@/path/to/file/myfile.pdf" -F settings="{\"mode\":\"convertToImages\",\"format\":\"png\"}" https://my-self-hosted-service.com/jpedal

Configuration Options

The JPedal API accepts a stringified JSON object containing key value pair configuration options to customise your extraction. The settings should be added before the URL in the cURL command. A full list of the configuration options to extract text from PDFs can be found here.

Note that the syntax for escaping double quotes can vary depending on the environment you use, make sure to check what works for your specific environment:

Note that for PowerShell your command should start with curl.exe --% to avoid parsing errors.

Upload by URL

As well as uploading a local file you can also provide a URL which the JPedal Microservice will download and then perform the extraction. To do this you should replace the input and file values with the following.

-F input=download -F url="http://exampleURL/exampleFile.pdf"

Using Authentication

If you have deployed your own JPedal Microservice that requires a username and password to extract text from PDFs, you will need to provide them with each conversion. These are provided by adding the user flag with a username and password before the URL.

--user username:password

Further details

Official cURL website
JPedal Microservice API
JPedal Microservice Use