Search code examples
c#lucenelucene.net

How to highlight only results of PrefixQuery in Lucene and not whole words?


I'm fairly new to Lucene and perhaps doing something really wrong, so please correct me if it is the case. Being searching for the answer for a few days now and not sure where to go from here.

The goal is to use Lucene.NET to search for user names with partial search (like StartsWith) and highlight only the found parts. For instance if I search for abc in a list of ['a', 'ab', 'abc', 'abcd', 'abcde'] it should return just the last three in a form of ['<b>abc</b>', '<b>abc</b>d', '<b>abc</b>de']

Here is how I approached this.

First the index creation:

using var indexDir = FSDirectory.Open(Path.Combine(IndexDirectory, IndexName));
using var standardAnalyzer = new StandardAnalyzer(CurrentVersion);

var indexConfig = new IndexWriterConfig(CurrentVersion, standardAnalyzer);
indexConfig.OpenMode = OpenMode.CREATE_OR_APPEND;

using var indexWriter = new IndexWriter(indexDir, indexConfig);
if (indexWriter.NumDocs == 0)
{
    //fill the index with Documents
}

The documents are created like this:

static Document BuildClientDocument(int id, string surname, string name)
{
    var document = new Document()
    {
        new StringField("Id", id.ToString(), Field.Store.YES),

        new TextField("Surname", surname, Field.Store.YES),
        new TextField("Surname_sort", surname.ToLower(), Field.Store.NO),

        new TextField("Name", name, Field.Store.YES),
        new TextField("Name_sort", name.ToLower(), Field.Store.NO),
    };
    
    return document;
}

The search is done like this:

using var multiReader = new MultiReader(indexWriter.GetReader(true)); //the plan was to use multiple indexes per entity types
var indexSearcher = new IndexSearcher(multiReader);

var queryString = "abc"; //just as a sample
var queryWords = queryString.SplitWords();

var query = new BooleanQuery();
queryWords
    .Process((word, index) =>
    {
        var boolean = new BooleanQuery()
        {
            { new PrefixQuery(new Term("Surname", word)) { Boost = 100 }, Occur.SHOULD }, //surnames are most important to match
            { new PrefixQuery(new Term("Name", word)) { Boost = 50 }, Occur.SHOULD }, //names are less important
        };
        boolean.Boost = (queryWords.Count() - index); //first words in a search query are more important than others
        
        query.Add(boolean, Occur.MUST);
    })
;

var topDocs = indexSearcher.Search(query, 50, new Sort( //sort by relevance and then in lexicographical order
    SortField.FIELD_SCORE,
    new SortField("Surname_sort", SortFieldType.STRING),
    new SortField("Name_sort", SortFieldType.STRING)
));

And highlighting:

var htmlFormatter = new SimpleHTMLFormatter();
var queryScorer = new QueryScorer(query);
var highlighter = new Highlighter(htmlFormatter, queryScorer);
foreach (var found in topDocs.ScoreDocs)
{
    var document = indexSearcher.Doc(found.Doc);
    var surname = document.Get("Surname"); //just for simplicity
    var surnameFragment = highlighter.GetBestFragment(standardAnalyzer, "Surname", surname);
    Console.WriteLine(surnameFragment);
}

The problem is that the highlighter returns results like this:

<b>abc</b>
<b>abcd</b>
<b>abcde</b>
<b>abcdef</b>

So it "highlights" entire words even though I was searching for partials. Explain returned NON-MATCH all the way so not sure if it's helpful here.

Is it possible to highlight only the parts which were searched for? Like in my example.


Solution

  • While searching a bit more on this I came to a conclusion that to make such highlighting work one needs to tweak index generation methods and split indices by parts so offsets would be properly calculated. Or else highlighting will highlight only surrounding words (fragments) entirely.

    So based on this I've managed to build a simple highlighter of my own.

    public class Highlighter
    {
        private const string TempStartToken = "\x02";
        private const string TempEndToken = "\x03";
    
        private const string SearchPatternTemplate = $"[{TempStartToken}{TempEndToken}]*{{0}}";
        private const string ReplacePattern = $"{TempStartToken}$&{TempEndToken}";
    
        private readonly ConcurrentDictionary<HighlightKey, Regex> _regexPatternsCache = new();
    
        private static string GetHighlightTypeTemplate(HighlightType highlightType) =>
            highlightType switch
            {
                HighlightType.Starts => "^{0}",
                HighlightType.Contains => "{0}",
                HighlightType.Ends => "{0}$",
                HighlightType.Equals => "^{0}$",
                _ => throw new ArgumentException($"Unsupported {nameof(HighlightType)}: '{highlightType}'", nameof(highlightType)),
            };
    
        public string Highlight(string text, IReadOnlySet<string> words, string startToken, string endToken, HighlightType highlightType)
        {
            foreach (var word in words)
            {
                var key = new HighlightKey
                {
                    Word = word,
                    HighlightType = highlightType,
                };
    
                var regex = _regexPatternsCache.GetOrAdd(key, _ =>
                {
                    var parts = word.Select(w => string.Format(SearchPatternTemplate, Regex.Escape(w.ToString())));
                    var pattern = string.Concat(parts);
                    var highlightPattern = string.Format(GetHighlightTypeTemplate(highlightType), pattern);
    
                    return new Regex(highlightPattern, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);
    
                });
                
                text = regex.Replace(text, ReplacePattern);
            }
    
            return text
                .Replace(TempStartToken, startToken)
                .Replace(TempEndToken, endToken)
            ;
        }
    
        private record HighlightKey
        {
            public string Word { get; init; }
            public HighlightType HighlightType { get; init; }
        }
    }
    
    public enum HighlightType
    {
        Starts,
        Contains,
        Ends,
        Equals,
    }
    

    Use it like this:

    var queries = new[] { "abc" }.ToHashSet();
    var search = "a ab abc abcd abcde";
    
    var highlighter = new Highlighter();
    var outputs = search
        .Split((string[])null, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
        .Select(w => highlighter.Highlight(w, queries, "<b>", "</b>", HighlightType.Starts))
    ;
    
    var result = string.Join(" ", outputs).Dump();
    Util.RawHtml(result).Dump();
    

    Output looks like this:

    a ab <b>abc</b> <b>abc</b>d <b>abc</b>de
    
    a ab abc abcd abcde

    I'm open to any other better solutions.