Search code examples
javasolrtokenize

Solr wildcard search incorrect result


I have some unexpected results when i make wildcard queries. I am using solr 6.6.0. edismax handler inside solr ui. The following query return results as expected without wildcard - firstNames:James, but when i add wildcard there are no results found. without wildcard with wildcard For firstNames field i use default fieldType text_en with default tokenizers and filters. When i run exact same query for firstNames:Stephen and firstNames:Stephen* i got results in both wildcard and not wildcard searches. Below is my field xml inside schema.xml:

  <field name="firstNames" type="text_en" multiValued="true" indexed="true" stored="true"/>
  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
  </fieldType>

Solution

  • When you're doing a wildcard query the analysis chain is not invoked (well, that's a small lie - it is, but only the components that are MultiTermAware - which usually means that the LowercaseFilter is the only thing that still is active).

    Since you have a stemming filter and the possessive filter attached, the end s on James is removed. Since this only happens on index time (remember, when you're using a wildcard, the analysis chain is generally skipped on query), the token jame is stored in the index.

    When you make the query firstNames:James*, you ask Solr to "find any document that contains tokens that start with James. Since what was stored is the token jame, there are no tokens matching james.

    When you try this with Stephen instead, neither stemming or possessive filter removes the end of the word, so Stephen* looks for any token starting with stephen, and since that token is present (nothing got changed), a match is returned.

    The solution depends on your use case; there is no need for a stemming or possessive filter on a name field, since that doesn't really make sense for names (instead you might apply your own logic to match similar-ish names). Another option is to use an ngramfilter instead, effectively generating a token for each prefix and infix version of the token (foo, f, fo, oo, o).