The search.json file contains the textual content of the entire document in JSON format. It is an optional asset which can be enabled with a configuration option. It is enabled when using the IDRViewer Complete UI (BuildVu’s default setting).
[ "Text on page 1", "Text on page 2", "...etc" ]
The purpose of the search.json file is to simplify document search. It provides an easy way to search the entire content of the document without needing to load and parse every HTML or SVG page.
The content of the search.json exactly matches the HTML/SVG content with the following exceptions:
In the HTML and SVG content, ampersands
> and less-than
< characters are replaced with the HTML entities
& > < whereas in the search.json file they are not.
In PDF files, a glyph may have a multi-character extraction value such as
ffi. This is commonly seen with ligatures, where the ligature appearance is used for stylistic purposes and the extraction value is the decomposed form.
In such cases, the HTML/SVG content contains the ligature form (which is required to ensure the font displays the correct appearance), and the search.json file contains the decomposed form.
When this occurs, a data attribute is added to the HTML/SVG which defines how to map the DOM content to the search.json content. E.g.
<span data-mappings="[[13,'ffi']]">The ligature ﬃ decomposes to ffi.</span>
In PDF files, sometimes the appearance of text is correct but the extraction values are broken. This may happen if the PDF creation tool was misconfigured, a PDF distiller was used, or the author deliberately broke the encoding to prevent meaningful text extraction. The result is similar to Mojibake.
In such cases, BuildVu may need to remap the extraction values to prevent the broken values (which may be in the control character or combining character range) from breaking the appearance of the document.
When this occurs, BuildVu may remap the characters to the Private Use Area (PUA) in the font. When such characters are displayed in the HTML or SVG, the font ensures the correct appearance of the characters. However, when displaying such characters without the font (e.g. if returned as a search result snippet), this will result in the incorrect appearance of the character.
In most applications, characters in the Private Use Area range (
U+E000–U+F8FF) are typically displayed as the .notdef glyph, however this is not guaranteed.
Have more questions? Ask us on Discord