Search code examples
apachesolrcode-snippetshighlighting

Solr document snippets not returned for a document that contains the query word


I am using Solr 3.6.2 to extract snippets for documents that I am certain to contain a specific string. (First of all, is that usage correct?) Unfortunately, I get snippets that do not contain my query string (simple, single, non-stop word).

For example, for the document 123456, that I know to contain "funmitflags", I have a query of the type:

id:123456 and content_en:funmitflags

and

fl=id&hl=true&hl.fl=content_en&hl.snippets=2&hl.alternateField=content_en&hl.maxAlternateFieldLength=400&hl.maxAnalyzedCharacters=2147483647&hl.fragsize=400&rows=100

(I put my "content_en" as alternate field in order to get any snippet from the document. I usually have large amount of texts in this field.) But, now I usually get returned the first 400 characters instead of those that contain my "funmitflags" word.

I can retrieve the document from the admin page, anyway, just not a proper highlight. It is awkward, because I have this problem with about ~75% of all queries.

In my schema.xml, I have "content_en" to be defined as "text_en".

<field name="content_en" type="text_en" indexed="true" stored="true" />

I changed "text_en" from the original definition, to the following:

 <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" 
            generateNumberParts="1" 
            catenateWords="0" 
            catenateNumbers="0" 
            catenateAll="0" 
            splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" 
            generateWordParts="1" 
            generateNumberParts="1" 
            catenateWords="0" 
            catenateNumbers="0" 
            catenateAll="0" 
            splitOnCaseChange="1"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="lang/stopwords_en.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

Reindexed, I get no correct snippets in both cases. Can someone give me a direction? Should I always get a snippet containing my search word?


Solution

  • Thanks for @arun's experiment that truncated half the possibilities, I found a solution.

    • As my texts are very large, I set in solrconfig.xml

      <maxFieldLength>1000000</maxFieldLength>

    • In order to increase the speed I started using fastVectorHighlighter:

      solrQuery.set("hl.useFastVectorHighlighter", true); to my query. Seems that it disabled my highligherSimplePre and highligherSimplePost, but who cares.

    Also, I had to add the term* options to my content field:

    ` <field name="content_en" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />`
    
    • Ofcourse, reindexing was performed.