Extract Text from PDF using PHP
Table of contents
- Introduction
- Prerequisites
- Code Example
- Return result to a callback url
- Configuration Options
- Upload by URL
- Using Authentication
- Further details
Introduction
The following tutorial shows you how to extract text from PDFs using a hosted JPedal cloud API, such as:
- the IDRsolutions trial and cloud subscription service
- your own self-hosted JPedal microservice
Whilst all the above services can be accessed with plain old HTTP requests, this tutorial uses our open source PHP IDRCloudClient which provides a simple PHP wrapper around the REST API.
Prerequisites
Using composer, install the idrsolutions-php-client package with the following command:
composer require idrsolutions/idrsolutions-php-client
Code Example
Here is a basic code example to extract text from PDFs. Configuration options and advanced features can be found below.
<?php
require_once __DIR__ . "/PATH/TO/vendor/autoload.php";
use IDRsolutions\IDRCloudClient;
$endpoint = "https://cloud.idrsolutions.com/cloud/" . IDRCloudClient::INPUT_JPEDAL;
$parameters = array(
//'token' => 'Token', // Required only when connecting to the IDRsolutions trial and cloud subscription service
'input' => IDRCloudClient::INPUT_UPLOAD,
'file' => __DIR__ . 'path/to/file.pdf',
'settings' => '{"mode":"extractText","type":"plainText"}'
);
$results = IDRCloudClient::convert(array(
'endpoint' => $endpoint,
'parameters' => $parameters
));
IDRCloudClient::downloadOutput($results, __DIR__ . '/');
echo $results['downloadUrl'];
Return result to a callback url
The JPedal Microservice accepts a callback url to send the status of a extraction on completion. Using a callback url removes the need to poll the service to determine when the extraction is complete.
The callback url can be provided to the parameters array as shown below.
$parameters = array(
//'token' => 'Token', // Required only when connecting to the IDRsolutions trial and cloud subscription service
'input' => IDRCloudClient::INPUT_UPLOAD,
'callbackUrl' => 'http://listener.url',
'file' => __DIR__ . 'path/to/file.pdf',
'settings' => '{"mode":"extractText","type":"plainText"}'
);
Configuration Options
The JPedal API accepts a stringified JSON object containing key value pair configuration options to customise your extraction. The settings should be added to the parameters array. A full list of the configuration options to extract text from PDFs can be found here.
'settings' => '{"key":"value","key":"value"}'
Upload by URL
As well as uploading a local file you can also provide a URL which the JPedal Microservice will download and then perform the extraction. To do this you should replace the input and file values in the parameters array with the following.
'input' => IDRCloudClient.DOWNLOAD
'url' => 'http://exampleURL/exampleFile.pdf'
Using Authentication
If you have deployed your own JPedal Microservice that requires a username and password to extract text from PDFs, you will need to provide them with each conversion. These are provided by adding two variables named username and password to the parameters array as shown below.
'username' => 'Username_If_Required',
'password' => 'Password_If_Required',
If this is the case you will also need to provide the authentication values to the downloadOutput method as well.
IDRCloudClient::downloadOutput($results, __DIR__ . '/','newFileName','username','password');
Further details
IDRCloudClient on GitHub
IDRCloudClient on Packagist
JPedal Microservice API
JPedal Microservice Use