I wanted to understand whether Sunspot, in standard mode, searches for words or sequences of characters in full-text search and how to make it search for sequences.
For example, I have the following setup:
class User < ActiveRecord::Base
searchable do
text :email
end
end
with one User
with e-mail "[email protected]"
the following query :
search = User.search do
fulltext 'matsinopoulos'
end
does not bring any result, whereas:
search = User.search do
fulltext '[email protected]'
end
brings.
Is there any configuration setting for sunspot to match sequences of characters instead of words?
Or, am I doing something wrong?
One needs to configure file:
solr/conf/schema.xml
The standard entry:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
has to be turned to:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory"
minGramSize="3"
maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>`
</fieldType>
A very nice reference on Solr configuration can be found here:
http://techbot.me/2011/01/full-text-search-in-in-rails-with-sunspot-and-solr/
but, watch out that when it comes to partial words matching this reference talks about the EdgeNGramFilterFactory
which indexes the beginnings of the words only. For making Solr match any part of the word, the NGramFilterFactory
needs to be used.
Note also that we have set minGramSize
to 3
and maxGramSize
to 30
. So, patterns with length less than 3 or greater than 30 will not be returned in queries.