Link

search.json

The search.json file contains the textual content of the entire document in JSON format. It is an optional asset which can be enabled with a configuration option. It is enabled when using the IDRViewer Complete UI (BuildVu’s default setting).

Basic structure:

[
  "Text on page 1",
  "Text on page 2",
  "...etc"
]

Purpose

The purpose of the search.json file is to simplify document search. It provides an easy way to search the entire content of the document without needing to load and parse every HTML or SVG page.

Important Differences

The content of the search.json exactly matches the HTML/SVG content with the following exceptions:

1. HTML Entities

In the HTML and SVG content, ampersands &, greater-than > and less-than < characters are replaced with the HTML entities &amp; &gt; &lt; whereas in the search.json file they are not.

2. Ligatures

In PDF files, a glyph may have a multi-character extraction value such as ffi. This is commonly seen with ligatures, where the ligature appearance is used for stylistic purposes and the extraction value is the decomposed form.

In such cases, the HTML/SVG content contains the ligature form (which is required to ensure the font displays the correct appearance), and the search.json file contains the decomposed form.

When this occurs, a data attribute is added to the HTML/SVG which defines how to map the DOM content to the search.json content. E.g. <span data-mappings="[[13,'ffi']]">The ligature ffi decomposes to ffi.</span>

Other Notable Information

In PDF files, sometimes the appearance of text is correct but the extraction values are broken. This may happen if the PDF creation tool was misconfigured, a PDF distiller was used, or the author deliberately broke the encoding to prevent meaningful text extraction. The result is similar to Mojibake.

In such cases, BuildVu may need to remap the extraction values to prevent the broken values (which may be in the control character or combining character range) from breaking the appearance of the document.

When this occurs, BuildVu may remap the characters to the Private Use Area (PUA) in the font. When such characters are displayed in the HTML or SVG, the font ensures the correct appearance of the characters. However, when displaying such characters without the font (e.g. if returned as a search result snippet), this will result in the incorrect appearance of the character.

In most applications, characters in the Private Use Area range (U+E000–U+F8FF) are typically displayed as the .notdef glyph, however this is not guaranteed.


What's included in your BuildVu trial?

  • Access to download the SDK and run it locally.
  • Access to the cloud trial to convert documents in the IDR cloud.
  • Access to the Docker image to set up your own trial server in the cloud.
  • Communicate with IDR developers to ask questions & get expert advice.
  • Plenty of time to experiment and build a proof of concept.
  • Over 100 articles to help you get started and learn about BuildVu.
  • An exceptional PDF to HTML converter that took over 20 years to build!

Start Your Free Trial