BuildVu is designed to handle any valid PDF file and is unbiased towards any specific type of PDF file. That said, PDF files can vary in quality, and BuildVu can only work with the data it is given. Generally speaking, garbage in means garbage out!
If you have control of the creation of the PDF, there are some things you can do to ensure you are future-proofing the content and getting the best out of BuildVu:
- Avoid tools or settings that compress the PDF
- Ensure fonts are embedded
- Enable marked/tagged/structured content
- Create the files as PDF/A
Tools that compress PDF files get judged by how well they reduce the file size, and often achieve this by removing important information that can cause problems down the line. A compressed PDF can often ‘look’ fine, but under the hood be a different story.
Some examples of problems resulting from compressed PDF files that we have seen include:
- broken extraction of text due to the removal of character mappings
- fractional white lines appearing in images due to images getting tiled
- fractured text output due to the removal of width data in the font
- loss of image quality due to images getting overly compressed
A compressed PDF file rarely affects the file size generated by BuildVu, therefore we generally recommend avoiding those tools/settings if possible.
PDF files can be created to rely on fonts that are stored on the local file system instead of embedding them within the PDF file.
When this happens, BuildVu substitutes any non-embedded fonts with open source fallbacks. To ensure the appearance remains accurate, we recommend embedding all fonts if possible.
A standard PDF file does not contain any kind of structural information (such as paragraphs, headings, etc). Marked content is an optional feature for tagging the content in PDF files with additional structural information. Most PDF files that we see do not include it, but if you have control of the PDF creation then we do strongly recommend enabling it.
BuildVu does not currently make use of marked content, but in the future we do plan to investigate if we can make better use of it when it is available.
PDF is a very powerful file format, and with great power comes great responsibility. Not all PDF creation tools are equal, and some do a better job of it than others. As with HTML parsers, PDF parsers are expected to handle documents that do not fully comply with the spec. We are often making tweaks to our parser to handle documents with questionable interpretations of the PDF specification.
Enter PDF/A: PDF/A is a more modern, stricter version of the specification that includes provisions to ensure the document preserves information relating to content extraction and document accessibility. This goes beyond the intentions of the original PDF specification which was primarily as a print format.
If the tool you’re using has the option to enable PDF/A, we strongly recommend enabling it!
Have more questions? Ask us on Discord