Extracting PDF Metadata and Metrics in Java
The JPedal library can be used to extract metadata about the PDF file. There are several PdfUtilities class.
You can use PdfUtilities in your own applications with the following example code, just remove the lines you do not require.
final PdfUtilities utilities = new PdfUtilities("path/to/exampleFile.pdf");
utilities.setPassword("password"); //Only required is file requires password
try {
if (utilities.openPDFFile()) {
//Returns a String containing the PDF version for the document
final String documentPDFVersion = utilities.getPDFVersion();
//Returns true if files contains any embedded fonts
final boolean hasEmbeddedFonts = utilities.hasEmbeddedFonts();
//Returns a map where the key is the page number and the value is a String detailing fonts for that page
final Map<Integer, String > documentFontData = utilities.getAllFontDataForDocument();
//Returns a String containing all metadata fields for the document
final String documentPropertiesAsXML = utilities.getDocumentPropertyFieldsInXML();
//Returns a map where the key is the property name and the value is the properties value
final Map<String, String > documentPropertiesAsMap = utilities.getDocumentPropertyStringValuesAsMap();
//Returns a boolean to show true if the file confirms to all tagged PDF conventions. It may be possible to extract some tagged content even if false
final boolean isFullyTagged = utilities.isMarkedContent();
//Returns the permissions value for this PDF and shows the permissions as a string in the console
final int permissions = utilities.getPdfFilePermissions();
PdfUtilities.showPermissionsAsString(permissions);
//Returns the total page count as an int
final int totalPageCount = utilities.getPageCount();
for (int i = 1; i != totalPageCount; i++) {
//Get the page dimensions for the specified page in the given units and type
final float[] pageDimensions = utilities.getPageDimensions(i, PdfUtilities.PageUnits.Pixels, PdfUtilities.PageSizeType.CropBox);
//Returns the total number of PDF commands used to define the specified pages content
final int commandCountForPage = utilities.getCommandCountForPageStream(i);
//Returns the font data as a string for the specified page
final String fontDataForPage = utilities.getFontDataForPage(i);
//Returns the image data as a String for the specified page
final String xImageDataForPage = utilities.getXImageDataForPage(i);
}
}
} catch(final PdfException e) {
e.printStackTrace();
}
utilities.closePDFfile();
The PDFUtilities class can also be instanced using a second constructor that will accept a pdf as a byte array instead of a file name. This constructor can be used as follows.
byte[] pdfByteArray;
//Read PDF into pdfByteArray
final PdfUtilities utilities = new PdfUtilities(pdfByteArray);