I am using Solr 3.6.2 to extract snippets for documents that I am certain to contain a specific string. (First of all, is that usage correct?) Unfortunately, I get snippets that do not contain my query string (simple, single, non-stop word).
For example, for the document 123456, that I know to contain "funmitflags", I have a query of the type:
id:123456 and content_en:funmitflags
and
fl=id&hl=true&hl.fl=content_en&hl.snippets=2&hl.alternateField=content_en&hl.maxAlternateFieldLength=400&hl.maxAnalyzedCharacters=2147483647&hl.fragsize=400&rows=100
(I put my "content_en" as alternate field in order to get any snippet from the document. I usually have large amount of texts in this field.) But, now I usually get returned the first 400 characters instead of those that contain my "funmitflags" word.
I can retrieve the document from the admin page, anyway, just not a proper highlight. It is awkward, because I have this problem with about ~75% of all queries.
In my schema.xml, I have "content_en" to be defined as "text_en".
<field name="content_en" type="text_en" indexed="true" stored="true" />
I changed "text_en" from the original definition, to the following:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Reindexed, I get no correct snippets in both cases. Can someone give me a direction? Should I always get a snippet containing my search word?
Thanks for @arun's experiment that truncated half the possibilities, I found a solution.
As my texts are very large, I set in solrconfig.xml
<maxFieldLength>1000000</maxFieldLength>
In order to increase the speed I started using fastVectorHighlighter:
solrQuery.set("hl.useFastVectorHighlighter", true);
to my query. Seems that it disabled my highligherSimplePre and highligherSimplePost, but who cares.
Also, I had to add the term* options to my content field:
` <field name="content_en" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />`