Search code examples
solrsolrcloudsynonymstemming

stop synonyms.txt file Solr from being stemmed


In synonyms.txt file I have an entry marine => saltwater,marine but both the words are getting stemmed to 'saltwat', 'marin' respectively inspite of being in protected words file. Is there a way to avoid it?

schema.xml

 <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
      <filter class="solr.ASCIIFoldingFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" analyzer="org.apache.lucene.analysis.en.EnglishAnalyzer" />
    </analyzer>
  </fieldType>

synonyms.txt

marine => saltwater,marine

protwords.txt

saltwater
marine

now when I do the analysis in admin panel and query for saltwat then saltwat | marin comes up. which means that saltwater did get stemmed to saltwat in synonyms.txt file saltwat | marin


Solution

  • The solr analysis works in the same sequence you declare it inside your fieldType definition in schema. So, if you declare any Stem filter after the Synonyms filter, it will be applied after the synonyms changes. If you don't want this, the SynonymsFilter should be configured after the StemFilter, for example:

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.ASCIIFoldingFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPossessiveFilterFactory"/>
          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
          <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.PorterStemFilterFactory"/>
        </analyzer>
        <analyzer type="query">
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
          <filter class="solr.ASCIIFoldingFilterFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPossessiveFilterFactory"/>
          <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
          <filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.PorterStemFilterFactory"/>
          <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
        </analyzer>
      </fieldType>
    

    I recommend you to check Solr Analysis tool in your Solr Admin to check what's going on with your field in both indexing and querying time.

    Please share your schema if you need more help.