Search code examples
solrfacetstop-words

Solr facets ignore stopwords at query time


I am using Solr 4.6.0 and I am trying to get the most frequent terms grouped by year. Since it is possible that my stopwords can change often, I do not apply the stopwords at indexing time. Instead, all dynamic word-lists like stopwords, protwords and synonyms are used at query time. But although the stopword-list includes terms like "of" and "the", they are still displayed in the result-list (see Results).

Question: How can I get facetted and stopword-filtered results, if I use the StopFilterFactory only at query time?

Additional information

If I use the StopFilterFactory at indexing time, everything is as expected. The terms like "of" and "the" are filtered out, when I run my query.

I also have tested the functionality of the fieldtype text_en with the Solr admin analysis tool and the results are as expected - "of" and "the" are filtered out. That means that somehow the SearchHandler does not call the right analyzer?

Query

http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=year,text

Results

[..]
<lst name="facet_pivot">
  <arr name="year,text">
    <lst>
      <str name="field">year</str>
      <int name="value">2009</int>
      <int name="count">139</int>
      <arr name="pivot">
        <lst>
          <str name="field">text</str>
          <str name="value">of</str>
          <int name="count">135</int>
        </lst>
        <lst>
          <str name="field">text</str>
          <str name="value">the</str>
          <int name="count">135</int>
        </lst>
        <lst>
          <str name="field">text</str>
          <str name="value">and</str>
          <int name="count">123</int>
[..]

Schema.xml

<field name="year" type="int" indexed="true" stored="true" />
    <field name="text" type="text_en" indexed="true" stored="true" multiValued="true" />
    [..]
    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
            <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
        </fieldType>

Solution

  • Is it not because of your query?

    http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=year,text
    

    From what I can see, you're searching for everything, so it means it will return the stopwords also. I mean, if the query is getting passed to the analyzer, the filter class of the analyzer only see

    *:* 
    

    as the query, so I don't think it will remove anything from the query string that way.

    If you really want to search for everything, but without any stopwords, you can try to either search with the negative query. Of course, if you use this, you will need to have a different configuration which doesn't filter any stopwords for the query, then you can put the stopwords manually as negative query to filter them out. So you're basically searching for anything, but leaving out the result which contains the negative query.

    But one easy way (and better way according to my opinion) to get what you want is actually to use the copy field in the field configuration. But this will increase your index size. So what we do here with our solr is, aside from the normal field, we have other language fields like text_en, text_de, text_es etc. And we have a language detector which can detect the language, copy the field to the appropriate language, and run the correct stopwords filter.

    You can also do this if you want, in your schema.xml, just create a new field, text_en_filtered, and copy the text from text_en there, and filter the stopwords there. Then you can just search in that field which doesn't have any stopwords anymore.

    <field name="text_en_filtered" type="text_en_filtered" indexed="true" stored="false" multiValued="false"/>
    <copyField source="text" dest="text_en_filtered"/>
    <fieldType name="text_en_filtered" class="solr.TextField" positionIncrementGap="100">
        ... // Analyzer with stopwords filtering here..
    </fieldType>