Search code examples
solrhighlighting

solr : highlighting : hl.simple.pre/post doesn't appear sometime


With solr, I try to highlighting some text using hl.formatter option with hl.simple.pre/post.

My problem is that the hl.simple.pre/post code doesn't appear sometime in the highlighting results, I don't understand why.

By example I call this URL :

http://localhost:8080/solr/Employees/select?q=lastName:anthan&fl=lastName&wt=json&indent=true&hl=true&hl.fl=lastName&hl.simple.pre=<em>&hl.simple.post=</em>

I get :

 ..."highlighting": {
    "NB0094418": {
      "lastName": [
        "Yogan<em>anthan</em>" => OK
      ]
    },
    "NB0104046": {
      "lastName": [
        "Vijayakanthan" => KO, I want Vijayak<em>anthan</em>
      ]
    },
    "NB0144981": {
      "lastName": [
        "Parmananthan" => KO, I want Parman<em>anthan</em>
      ]
    },...

Someone have an idea why I have this behavior ?

My configuration :

schema.xml

<fieldType name="nameType" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="50" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
    </analyzer>

    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
    </analyzer>
</fieldType>

...
<fields>
    <field name="lastName" type="nameType" indexed="true" stored="true" required="true" />
</fields>

solrconfig.xml

<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <lst name="defaults">
        <str name="echoParams">explicit</str>
    </lst>
</requestHandler>

...

<searchComponent class="solr.HighlightComponent" name="highlight">
    <highlighting>
        <fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter">
            <lst name="defaults">
                <int name="hl.fragsize">100</int>
            </lst>
        </fragmenter>

        <fragmenter name="regex" class="solr.highlight.RegexFragmenter">
            <lst name="defaults">
                <int name="hl.fragsize">70</int>
                <float name="hl.regex.slop">0.5</float>
                <str name="hl.regex.pattern">[-\w ,/\n\&quot;&apos;]{20,200}</str>
            </lst>
        </fragmenter>

        <formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
            <lst name="defaults">
                <str name="hl.simple.pre"><![CDATA[<em>]]></str>
                <str name="hl.simple.post"><![CDATA[</em>]]></str>
            </lst>
        </formatter>

        <encoder name="html" default="true" class="solr.highlight.HtmlEncoder" />

        <fragListBuilder name="simple" default="true" class="solr.highlight.SimpleFragListBuilder" />
        <fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder" />
        <fragmentsBuilder name="default" default="true" class="solr.highlight.ScoreOrderFragmentsBuilder">
        </fragmentsBuilder>

        <fragmentsBuilder name="colored" class="solr.highlight.ScoreOrderFragmentsBuilder">
            <lst name="defaults">
                <str name="hl.tag.pre"><![CDATA[
                <b style="background:yellow">,<b style="background:lawgreen">,
                <b style="background:aquamarine">,<b style="background:magenta">,
                <b style="background:palegreen">,<b style="background:coral">,
                <b style="background:wheat">,<b style="background:khaki">,
                <b style="background:lime">,<b style="background:deepskyblue">]]></str>
                <str name="hl.tag.post"><![CDATA[</b>]]></str>
            </lst>
        </fragmentsBuilder>
    </highlighting>
</searchComponent>

Solution

  • I was dealing with a very similar problem until yesterday. I tried many different solutions, iteratively, so some details I ended up with of this may not be necessary. But I'll describe what I got working eventually. Short answer, I think the highlighter is failing to find the term position information it needs on longer fields.

    Firstly, the symptoms I was seeing: sometimes the search term highlight would show up, and sometimes the entire field would show up in the highlighting section, but without the highlight information. The pattern ended up being based on both the length of the field, and the length of the search term. I found that the longer the field (actually, the token that was ngrammed), the shorter the search term that could be highlighted successfully. It wasn't 1-to-1, though. I found that for a field with 11 or fewer characters, highlighting worked fine in all cases. If the field had 12 characters, no ngram longer than 9 characters would be highlighted. For a field with 15 characters, ngrams longer than 7 characters would not be highlighted. For fields longer than 18 characters, ngrams longer than 6 characters would not be highlighted. And for fields longer than 21 characters, ngrams longer than 5 aren't highlighted, and fields longer than 24 characters wouldn't highlight more than 4 characters. (It looks like, from the examples you have above, that the specific sizes you are seeing are not exactly the same, but I do notice that the names in the documents where the highlighting did not work were longer than the one where it did.)

    So, here's what ended up working:

    1. I switched from using WhitespaceTokenizer and NGramFilterFactory to using NGramTokenizerFactory instead. (You are already using this, and I'll have more later on a difficulty this raised for me.) This wasn't sufficient to solve the problem, though, because the term positions still weren't being stored.
    2. I started using the FastVectorHighlighter. This forced some changes in how my schema fields were indexed (including storing storing the term vectors, positions and offsets), and I also had to change my pre- and post- indicator tag configuration from hl.simple.pre to hl.tag.pre (and similarly for *post).

    Once I had made these changes, the highlighting started working consistently. This had the side-effect, though, of removing the behavior I had been getting from the WhitespaceTokenizer. If I had a field that contained the phrase "this is a test" I was ending up with ngrams that included "s is a", "a tes", etc., and I really just wanted the ngrams of the individual words, not of the whole phrase. There is a note in the NGramTokenizer JavaDocs that you can override NGramTokenizer.isTokenChar() to provide pre-tokenizing, but I couldn't find an example of this on the web. I'll include one below.

    End result:

    WhitespaceSplittingNGramTokenizer.java:

    package info.jwismar.solr.plugin;
    
    import java.io.Reader;
    
    import org.apache.lucene.analysis.ngram.NGramTokenizer;
    import org.apache.lucene.util.Version;
    
    public class WhitespaceSplittingNGramTokenizer extends NGramTokenizer {
    
        public WhitespaceSplittingNGramTokenizer(Version version, Reader input, int minGram, int maxGram) {
            super(version, input, minGram, maxGram);
        }
    
        public WhitespaceSplittingNGramTokenizer(Version version, AttributeFactory factory, Reader input, int minGram,
                int maxGram) {
            super(version, factory, input, minGram, maxGram);
        }
    
        public WhitespaceSplittingNGramTokenizer(Version version, Reader input) {
            super(version, input);
        }
    
        @Override
        protected boolean isTokenChar(int chr) {
            return !Character.isWhitespace(chr);
        }
    }
    

    WhitespaceSplittingNGramTokenizerFactory.java:

    package info.jwismar.solr.plugin;
    
    import java.io.Reader;
    import java.util.Map;
    
    import org.apache.lucene.analysis.Tokenizer;
    import org.apache.lucene.analysis.ngram.NGramTokenizer;
    import org.apache.lucene.analysis.util.TokenizerFactory;
    import org.apache.lucene.util.AttributeSource.AttributeFactory;
    
    public class WhitespaceSplittingNGramTokenizerFactory extends TokenizerFactory {
    
        private final int maxGramSize;
        private final int minGramSize;
    
        /** Creates a new WhitespaceSplittingNGramTokenizer */
        public WhitespaceSplittingNGramTokenizerFactory(Map<String, String> args) {
            super(args);
            minGramSize = getInt(args, "minGramSize", NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE);
            maxGramSize = getInt(args, "maxGramSize", NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
            if (!args.isEmpty()) {
                throw new IllegalArgumentException("Unknown parameters: " + args);
            }
        }
    
        @Override
        public Tokenizer create(AttributeFactory factory, Reader reader) {
            return new WhitespaceSplittingNGramTokenizer(luceneMatchVersion, factory, reader, minGramSize, maxGramSize);
        }
    }
    

    These need to be packaged up into a .jar and installed someplace where SOLR can find it. One option is to add a lib directive in solrconfig.xml to tell SOLR where to look. (I called mine solr-ngram-plugin.jar and installed it in /opt/solr-ngram-plugin/.)

    Inside solrconfig.xml:

    <lib path="/opt/solr-ngram-plugin/solr-ngram-plugin.jar" />
    

    schema.xml (field type definition):

    <fieldType name="any_token_ngram" class="solr.TextField">
        <analyzer type="index">
            <tokenizer class="info.jwismar.solr.plugin.WhitespaceSplittingNGramTokenizerFactory" maxGramSize="30" minGramSize="2"/>
            <filter class="solr.LowerCaseFilterFactory" />
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.PatternReplaceFilterFactory"
                pattern="^(.{30})(.*)?" replacement="$1" replace="all" />
        </analyzer>
    </fieldType>
    

    schema.xml (field definitions):

    <fields>
        <field name="property_address_full" type="string" indexed="false" stored="true" />
        <field name="property_address_full_any_ngram" type="any_token_ngram" indexed="true"
            stored="true" omitNorms="true" termVectors="true" termPositions="true"
            termOffsets="true"/>
    </fields>
    <copyField source="property_address_full" dest="property_address_full_any_ngram" />
    

    solrconfig.xml (request handler (you can pass these parameters in the normal select URL, instead, if you prefer)):

    <!-- request handler to return typeahead suggestions -->
    <requestHandler name="/suggest" class="solr.SearchHandler">
        <lst name="defaults">
            <str name="echoParams">explicit</str>
            <str name="defType">edismax</str>
            <str name="rows">10</str>
            <str name="mm">2</str>
            <str name="fl">*,score</str>
            <str name="qf">
                property_address_full^100.0
                property_address_full_any_ngram^10.0
            </str>
            <str name="sort">score desc</str>
            <str name="hl">true</str>
            <str name="hl.fl">property_address_full_any_ngram</str>
            <str name="hl.tag.pre">|-&gt;</str>
            <str name="hl.tag.post">&lt;-|</str>
            <str name="hl.fragsize">1000</str>
            <str name="hl.mergeContinuous">true</str>
            <str name="hl.useFastVectorHighlighter">true</str>
        </lst>
    </requestHandler>