I am attempting to optimize highlighting in my SOLR instance as this seems to slow down queries by 2 orders of magnitude. I have a tokenized field index and stored with following definition:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\+" replacement="%2B"/>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" enablePositionIncrements="true" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\+" replacement="%2B"/>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Term vectors etc are also generated:
<field name="Events" type="text_general" multiValued="true" stored="true" indexed="true" termVectors="true" termPositions="true" termOffsets="true"/>
For the highlight component I use the default SOLR config. The query I try uses FastVectorHighlighter but still takes ~1500ms, which is awfully long for ~1000 docs with 10-20 values stored in the field per doc. Here is the query:
q=Events:http\://mydomain.com/resource/term/906&fq=(Document_Code:[*+TO+*])&hl.requireFieldMatch=true&facet=true&hl.simple.pre=<b>&hl.fl=*&hl=true&rows=10&version=2&fl=uri,Document_Type,Document_Title,Modification_Date,Study&hl.snippets=1&hl.useFastVectorHighlighter=true
What I find curious is that in the solr admin stats a single query generates 9146 requests to HtmlFormatter and GapFragmenter. Any thoughts on why this might be happening and how the performance of the highlighter can be improved?
It appears that the problem is caused by "hl.fl=*", which caused the DefaultSolrHighlighter to iterate over a relatively large number of fields (in my index) for each document found (10 max in my case). This causes the additional O(n^2) time. Here is the relevant code snippet:
for (int i = 0; i < docs.size(); i++) {
int docId = iterator.nextDoc();
Document doc = searcher.doc(docId, fset);
NamedList docSummaries = new SimpleOrderedMap();
for (String fieldName : fieldNames) {
fieldName = fieldName.trim();
if( useFastVectorHighlighter( params, schema, fieldName ) )
doHighlightingByFastVectorHighlighter( fvh, fieldQuery, req, docSummaries, docId, doc, fieldName );
else
doHighlightingByHighlighter( query, req, docSummaries, docId, doc, fieldName );
}
String printId = schema.printableUniqueKey(doc);
fragments.add(printId == null ? null : printId, docSummaries);
}
Reducing the number of fields should improve the behaviour greatly. However, in my case I cannot reduce it bellow 20 fields, so I will check whether enabling the FastVectorHighlighter for all of them will improve the overall performance.
I was also wondering whether we could reduce this list even further by using some info from the matching docs (which are already available at this point).
Update
Using FastVectorHighlighter for all fields (set termVectors, termPositions and termOffsets to true for all tokenized fields) did indeed improve the highlighting speed by an order of magnitude, so that now all queries run < 1s. The size of the index increased by 3 times its original value (from 500M to 2G). There is also a problem with how the fragments for multivalued fields are generated, but the improvement of performance is high enough.