Search code examples
solrlucenesolr5solr6

Solr dynamic field blowing up the index size


Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine but the problem is that index size with solr 6 is way too large. In solr 5, index size was about 15GB and in solr 6, for the same data, the index size is 300GB! I am not able to understand what contributes to such huge difference in solr 6.

I have been able to identify a field which is blowing up the size of index. It is as follows.

<dynamicField name="*_note" type="text_general" indexed="true" stored="true" multiValued="true"  />

<field name="textproperty" type="text_general" indexed="true" stored="false" multiValued="true"  />
<copyField source="*_note" dest="textproperty"/>

When this field is commented out, the index size reduces to less than 10GB.

This field is of type text_general. Following is the definition of this type.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((?m)[a-z]+)'s" replacement="$1s" />
        <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.KStemFilterFactory" /> 
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="C:/Users/pratik/Desktop/solr-6.4.1_playground/solr-6.4.1/server/solr/collection1/conf/stopwords.txt" />
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((?m)[a-z]+)'s" replacement="$1s" />
        <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.KStemFilterFactory" /> 
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="C:/Users/pratik/Desktop/solr-6.4.1_playground/solr-6.4.1/server/solr/collection1/conf/stopwords.txt" />
      </analyzer>
  </fieldType>

Few things which I did to debug this issue:

  • I have ensured that field type definition is same as what I was using in solr 5 and it is also valid in version 6. This field type considers a list of "stopwords" to be ignored during indexing. I have supplied the same list of stopwords which we were using in solr 5. I have verified that path of this file is correct and it is being loaded fine in solr admin UI. When I analyse these fields using "Analysis" tab of the solr admin UI, I can see that stopwords are being filtered out. However, when I query with some of these stopwords, I do get the results back which makes me think that probably stopwords are being indexed.

Any idea what could increase the size of index by so much in solr 6?


Solution

  • For anyone facing similar issue. The issue for me was that the field which caused index size to be increased disproportionately had a field type("text_general") for which default value of omitNorms was not true. Turning it on explicitly on field fixed the problem. Following is the link to my related question in solr mailing list.

    http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size