Working with XPaths

How to control what content is indexed and used in search results

We made the Site Search 360 crawler as intelligent as possible when it comes to analyzing your website and picking the right title, image, and content for your search results.

Nonetheless, it might still be necessary to fine-tune your indexing rules by pointing the crawler directly to the desired content or exclude unwanted pieces of information from being indexed and, therefore, used in the search.

This can be done via XPath expressions placed in the Site Search 360 control panel. You can set up the general rules on the Crawler settings page and the result grouping-specific rules on the Result Groups page (if you're using any).

Check out these steps:

  1. First, let's search and install the Google Chrome extension called "XPath Helper". It will allow us to easily define XPaths right from your own site.

  2. Navigate to one of your website's pages. Press the XPath Helper icon in the top right corner of your browser to open the black overlay which will reveal the currently selected XPath expression.

  3. Now we want to extract the main content. After opening the XPath Helper, hold the [Shift] key and hover your mouse over your website's elements.

    You will see how the extension highlights them in yellow while displaying the XPath query in the black overlay box. As you move your mouse this XPath query will change. Try to get all the content you are targeting highlighted in yellow.

    The Result half of the black overlay box allows you to preview the targeted content.

  4. Tweak your XPath expression by shortening it. There are two ways of shortening an XPath query — you can remove something from the end to match more child nodes or you can leave the tail and cut the head off to make it match more generally. Make sure your XPath always starts with // when shortening from the front.

  5. Copy the XPath query over to the Site Search 360 control panel and place it under Data Structuring -> Content Extraction in the appropriate XPath section: Title XPaths, Image XPaths, Include and Exclude Content XPaths.

  6. Press the "Test" button and enter your webpage URL to test the XPath query. If everything is fine, you will see the extracted content, headline, or image URL below. You can also Index Single URL to check what's going to be extracted from this page all at once.

Default XPaths and common strategies for your search results

You can use XPath expressions (one per line) for:

  1. Title XPaths pointing to the main title of the page. Default is //h1, i.e. the crawler takes your <h1> heading. Other common scenarios include //title, to pick up the page title tag content, or sometimes //h2. Change it according to your site structure.

    Title Regular expression allows you to apply a regular expression condition on the extracted titles, if you need even more control. For example, you might have your brand or company name repeated in every page title: <title>Working with XPaths – Site Search 360</title>

    To only use the "Working with XPaths" part as a search result title, use //title as your Title Xpath and add ([^–])+ as Regular expression, and the "– Site Search 360" part will be cut off.

  2. Image XPaths pointing to the main picture on your page. These images, if available, are automatically shown as search result thumbnails. Leave this field empty if our default crawler settings work well for your site or adjust to point to a specific image instead. For example: //img[@id='main']/@src

    If your images are lazy-loaded, try something similar to the following pattern: //div[@class='product-detail-images']//img/@data-src

    You can also tell the crawler to ignore all images by toggling "Extract Images" off. Alt texts and captions can be indexed separately.

  3. Default Image XPath pointing to the default image to be used when no other image is found. For example, //img[@id='logo']/@src

  4. Include Content XPaths pointing to the content blocks that should be indexed. One XPath per line. Leave empty if everything should be indexed.

  5. Exclude Content XPaths pointing to the content blocks that should be ignored by the crawler. One XPath per line. Leave empty if everything should be indexed.

  6. Search Snippet (is located under Search Settings -> Search Snippet -> Use content behind search snippet XPath) XPath pointing to the content that you want to display in the search results. By default, we show the content around the terms matching the search query.

    Another common strategy would be using your page meta descriptions instead. That's why //meta[@name="description"]/@content is pre-filled for you. To start showing meta descriptions in your search snippets, go to Search Settings and change the Search Snippet Source.