Search code examples
c#searchluceneindexingsitecore

How do you configure Lucene in Sitecore to only index the latest version of an item on the master db?


I recognise this is a moot point on the web database, so this question applies to the master db...

I have a custom index set up in Sitecore 6.4.1 as follows:

<index id="search_content_US" type="Sitecore.Search.Index, Sitecore.Kernel">
    <param desc="name">$(id)</param>
    <param desc="folder">_search_content_US</param>
    <Analyzer ref="search/analyzer" />
    <locations hint="list:AddCrawler">
        <search_content_home type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
            <Database>master</Database>
            <Root>/sitecore/content/usa home</Root>
            <Tags>home content</Tags>
        </search_content_home>
    </locations>
</index>

I query the index like this (I am using techphoria414's SortableIndexSearchContext from this answer: How to sort/filter using the new Sitecore.Search API):

private SearchHits GetSearchResults(SortableIndexSearchContext searchContext, string searchTerm)
    {
        CombinedQuery query = new CombinedQuery();
        query.Add(new FullTextQuery(searchTerm), QueryOccurance.Must);
        return searchContext.Search(query, Sort.RELEVANCE);
    }

...

SearchHits hits = GetSearchResults(searchContext, searchTerm);

hits is a collection of search hits from my index. When I iterate through hits I can see that there are many duplicates of the same items in Sitecore, 1 per version of the item.

I then do the following to get a SearchResultCollection:

SearchResultCollection results = hits.FetchResults(0, hits.Length);

This combines all of the duplicates into a single SearchResult object. This object represents 1 version of a particular item, and has a property called SubResults which is a collection of SearchResults that represent all of the other item versions.

Here's my problem:

The version of the item represented by the SearchResult is NOT the current published version of the item! It appears to be a randomly selected version (whichever the search method hit first in the index). The latest version is included in the SubResults collection, however.

E.g.:

SearchResult
 |
 |- Version 8 // main result
 ...
 |- SubResults
      |
      |- Version 9 // latest version
      |- Version 3
      |- Version 5
      ... // all versions in random order

How do I prevent this from happening on the master db? Either by preventing Lucene from indexing old versions of items, or by doing some manipulation of the result set to get the latest version from the SubResults?

As an aside, why does Lucene bother to index old versions of items anyway? Surely this is pointless for searching content on your website as the old versions are not visible?


Solution

  • You can implement a custom crawler that overrides the following:

    public class IndexCrawler : DatabaseCrawler
    {
        protected override void IndexVersion(Item item, Item latestVersion, Sitecore.Search.IndexUpdateContext context)
        {
            if (item.Versions.Count > 0 && item.Version.Number != latestVersion.Version.Number)
                return;
    
            base.IndexVersion(item, latestVersion, context);
        }
    }
    

    This ensures that only the latest version of an item gets into your Index, and therefore will be the only item pull out of said index

    You would need to update your configuration file to set the correct type for the index of course