I have some unexpected results when i make wildcard queries. I am using solr 6.6.0. edismax handler inside solr ui. The following query return results as expected without wildcard - firstNames:James, but when i add wildcard there are no results found.
For firstNames field i use default fieldType text_en with default tokenizers and filters. When i run exact same query for firstNames:Stephen and firstNames:Stephen* i got results in both wildcard and not wildcard searches. Below is my field xml inside schema.xml:
<field name="firstNames" type="text_en" multiValued="true" indexed="true" stored="true"/>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
When you're doing a wildcard query the analysis chain is not invoked (well, that's a small lie - it is, but only the components that are MultiTermAware - which usually means that the LowercaseFilter is the only thing that still is active).
Since you have a stemming filter and the possessive filter attached, the end s
on James
is removed. Since this only happens on index time (remember, when you're using a wildcard, the analysis chain is generally skipped on query), the token jame
is stored in the index.
When you make the query firstNames:James*
, you ask Solr to "find any document that contains tokens that start with James
. Since what was stored is the token jame
, there are no tokens matching james
.
When you try this with Stephen
instead, neither stemming or possessive filter removes the end of the word, so Stephen*
looks for any token starting with stephen
, and since that token is present (nothing got changed), a match is returned.
The solution depends on your use case; there is no need for a stemming or possessive filter on a name field, since that doesn't really make sense for names (instead you might apply your own logic to match similar-ish names). Another option is to use an ngramfilter instead, effectively generating a token for each prefix and infix version of the token (foo
, f
, fo
, oo
, o
).