Why is some text split into multiple elements?
BuildVu makes a best-effort attempt to group text into as few elements as possible whilst preserving the visual accuracy. In realtext modes, BuildVu takes the viewpoint that it is better to sacrifice small amounts of accuracy in exchange for output that is simple, fast to load, and bloat-free.
There are certain times when a new text element is required, such as when the font size, color or family changes, or if the position changes in a way that can’t be handled by word spacing or letter spacing.
There are also some PDF features that are not available in HTML which can impact grouping, such as:
- Text Kerning
- Alternate Character Appearances
1. Text Kerning
In PDF, custom spacing can be applied between individual characters. We see this used in a number of ways, for example:
- Small amounts of kerning for stylistic reasons (e.g. less gap between A & W in AWAY)
- Large amounts of kerning for positioning reasons (e.g. shift to next column in a table)
- Large negative amounts of kerning for positioning reasons (e.g. shift to previous column in a table)
- Used instead of space characters (PDF readers are expected to detect gaps and insert space characters in the extraction)
- Used instead of word spacing or letter spacing
- Used to undo other types of spacing (e.g. apply word or letter spacing then remove it with kerning)
BuildVu-HTML will average small amounts of kerning across the whole line, whereas BuildVu-SVG can correctly apply kerning as it is a feature of SVG. Large amounts of kerning may require multiple text elements to accurately position the text.
For more information about how text kerning is handled, please see How is text kerning handled?
2. Alternate Character Appearances
In HTML, a character can only have one appearance inside a font. In PDF, characters may have multiple appearances within a font.
For example, in HTML, the string “AAA” will always display the same appearance 3 times. However in PDF, the string “AAA” could appear as “ABC” on the page.
BuildVu deals with this by creating a new variant of the font when it encounters a character that already exists in the font with a different appearance. Switching to the new variant requires a separate element.
Dealing with Chaotic Text
Inside PDF files, it is very common for text to be defined in ways that you would not expect. This is partly because the PDF file format is designed to be a presentation format, and partly because the tools that create PDF files often do so in ‘creative’ ways.
In all cases, text in PDF is defined at most on a line by line basis (the concept of a paragraph or a table does not exist within the PDF format). In additional to kerning being used in creative ways, PDF files often choose to define text in small increments - sometimes positioning each character individually, sometimes by word, and sometimes in seemingly random groupings.
What is seen on the surface may look simple, but often what is going on under the hood is a lot more complex.
In most cases, BuildVu will make the content inside the PDF look simple (even when it is not).
If you do encounter text that is not grouping as you would expect (or hope), then we are happy to investigate what is going on and whether we can improve the way that BuildVu handles it. However, please do bear in mind that this is a complex problem and it is not always feasible for text elements to be combined.
In general, the quality of the output depends on the quality of the PDF.