Go to the homepage
Powered bySitecore Search logo
Skip to main contentThe Enhancing search with documents page has loaded.

Enhancing search with documents

Context

Content surfaced on sites might not always necessarily exist in your CMS. Product manuals, sales brochures, and technical documentation are good examples of assets that users may need to find but are not directly rendered as webpages.

By integrating your DAM such as Sitecore Content Hub with your search provider such as Sitecore Search, we can ensure these types of assets are indexed and discoverable without relying on conventional crawling methods.

The following scenario will focus on Sitecore products, but similar approaches can be utilized with other products. Other approaches of integrating data into your website are available, that are detailed in Custom Editing UX for 3rd Party Integrations recipe.

A reference connector can be found on GitHub - Content Hub to Sitecore Search Connector. The provided code is intended as a guideline and must be tailored to suit your specific implementation requirements. Please ensure thorough end-to-end testing is conducted to validate its functionality and performance in your environment.

Execution

Sitecore Search typically discovers and indexes content by crawling websites. However, in a multichannel world, content is often stored in various systems beyond a website, making traditional crawling less effective. Instead of pulling content from its source, we can push it directly from the source—ensuring more accurate, timely, and structured indexing.

Content Hub is often the central source of assets within an organisation. However, when integrating it with Sitecore Search, different content types need to be handled in different ways. These can usually be categorise into four groups:

  • Media: Includes assets such as images and videos.
  • Documents: Covers product manuals, brochures, and publications—typically PDFs, though the approach applies to other formats as well.
  • Content: Encompasses written material such as blogs, news articles, and white papers.
  • Structured Data: Includes technical content and specifications- usually product data.

Media assets are generally not included in search results, so Sitecore Search does not need to be aware of them. Documents, on the other hand, often contain important information that users need to find, so they must be made discoverable.

Before documents can be indexed, you need to determine which assets should be pushed to Sitecore Search. This could be based on the asset type, it being part of a collection, a stateflow state, a custom shouldPublish property, or something else.

Once the criteria are defined, a connector is required to transfer relevant documents to Sitecore Search. When publishing a document, the connector must ensure that a public link exists. If necessary, it generates one before sending the asset’s metadata and link to Sitecore Search for indexing. When unpublishing a document, the connector removes it from Sitecore Search. It is recommended to write the connector as a servlerless function, with Azure functions being our preference.

The connector will need to focus on two specific areas:

1. Event Hook & Trigger Configuration

  • Set up a trigger and action is configured in Content Hub to call the connector when assets either move into or out of the criteria.
  • Configure Content Hub to trigger these webhooks or events for every relevant content update.
  • Configure Search Push API according to the API documentation.

2. Data Preparation & Transformation

  • Ensure all required attributes (ID, title, description, image_url, content_type) are configured in Sitecore Search
  • Create and publish an API push source with Enable Incremental Updates turned on
  • Obtain an API key with the ingestion scope
  • Make a POST call for each locale to the Create Document endpoint with the source ID, domain ID, and locale, passing the document ID and attribute values

For large documents (e.g., PDFs), implement logic within the Content Hub event handler to extract and trim text content to a Search size limit. Strategies to consider:

  • Extract only the first N characters/words or relevant summary.
  • Strip images and unneeded binary content.
  • Consider running OCR or text summarization if necessary.
  • Log and handle scenarios where content size still exceeds limits after trimming.

Content Hub’s Media Processing configuration automatically generates a downloadExtractedContent rendition for certain file types, including .doc, .docx, .pdf, .pptx, .txt, and .xlsx. This rendition is a plain text file containing the document's extracted text, which can be sent to Sitecore Search to improve ingestion and searchability.

Since Sitecore Search has a 256KB payload limit, larger documents must be processed before submission. To avoid errors, it's best to strip unnecessary whitespace and truncate the text. A practical limit is 200KB, allowing room for metadata and other required fields.

Insights

The process of integrating content and structured data into Sitecore Search follows the same principles as for documents, with one key difference: public links are not required.

Content

Like documents, content items need clear publishing criteria. The best approach is to use Sitecore Experience Edge, where content conditions are defined, and the connector is triggered when the PublishState changes. If Experience Edge is not in use, the FinalLifecycleStatus of the content can be used instead.

Once the connector is triggered, it will either add or remove content from Sitecore Search. Since Sitecore Search enforces a 256KB payload limit, it’s often practical to send only an abstract or summary rather than the full content. The full article or page can then be retrieved via a Sitecore Search crawler from its published web location, ensuring a balance between search performance and content depth.

Textual relevance by attribute in Sitecore Search lets you assign weights (boosts) to specific attributes so that matches in higher-weighted fields rank above matches in lower-weighted ones.

Why weight shorter fields more heavily?

ReasonExplanation
Precision of intentShort fields like title, tags, or category are concise summaries. A match in these fields almost always indicates strong relevance to the user’s intent.
Signal-to-noise ratioLong fields (e.g., full description or body text) contain many words, some peripheral to the core topic. Matching within those large blocks is more likely to be incidental, diluting relevance.
Performance and stabilityBoosted short-field matches reduce reliance on term-frequency across verbose content, yielding more consistent rankings.

Best practice:

  • Assign highest boost to title and tags
  • Mid-level boost to short metadata (e.g., category, documentType)
  • Lower (or zero) boost to long-form fields (description, body)

This ensures that succinct, highly indicative fields drive result order, while longer fields still contribute to recall without overpowering the ranking.

Structured Data

Structured data - such as product information, technical specifications, or other metadata - is typically not displayed in full within websites or mobile applications, but it is critical for search and filtering. As with content and documents, a clear publishing workflow must be defined, with FinalLifecycleStatus being the most appropriate trigger for determining when structured data should be pushed to or removed from Sitecore Search.

Overall Approach

Content from Sitecore Content Hub can be made searchable in Sitecore Search by pushing data via a connector rather than relying on crawling. This ensures that updates are reflected promptly and accurately.

Within Content Hub, an action is configured to call the connector whenever an asset enters or exits the defined publishing criteria. This action is linked to two triggers:

  1. One for publishing, ensuring relevant content is indexed.
  2. One for unpublishing, ensuring outdated or removed content is no longer searchable.

© Copyright 2025, Sitecore. All Rights Reserved

Legal

Privacy

Get Help

LLM