Search code examples
solrsolr-highlight

How to highlight the longest solr token


I'm trying to highlight the exact search term from a query, but the highlighted term is coming back as the shortest token from my tokenized field. For example, a query of "Entr" would result in highlighting Ent ry. I'd like the highlighting to return Entr y

This is the simplest query that matches on every instance of the term in the answer: q=Title_Tokens:Entr&hl=on&hl.fl=Title_Tokens&hl.useFastVectorHighlighter=true

Removing the FastVectorHighlighter gives the entire term, but only once per result, and in some cases it isn't matching.

I've tried adding in hl.q, hl.highlightingMultiTerm, hl.usePhraseHighlighter, and several other variables, but I can only get every instance of the shortest token or the first instance of the search term.

The field I'm trying to highlight on is Title_Tokens, which is copied from a string.

<field name="RawTitle" type="string" required="true" />
<field name="Title_Tokens" type="Tokenized_Title" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<copyField source="RawTitle" dest="Title_Tokens" />

<fieldType name="Tokenized_Title" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" maxGramSize="15" minGramSize="3"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

When analyzing my field for "entr", I see tokens for "ent, entr, and ntr". To me, it looks like the first token to match is highlighted, but I want to prioritize the longest match. Is that what is happening, or am I doing something else incorrectly?

I also considered using EdgeNGramTokenizerFactory to match from the back of the word, but that would stop matches in the middle of the word.


Solution

  • The field needed a query. The indexer was working correctly, but it was matching on everything and returning the first matching token. Combined with the query analyzer, only the longest result is matched.

    <fieldType name="Tokenized_Title" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>