Search code examples
ruby-on-railssunspotsunspot-rails

How can I set Sunspot to search for sequences of characters instead of words?


I wanted to understand whether Sunspot, in standard mode, searches for words or sequences of characters in full-text search and how to make it search for sequences.

For example, I have the following setup:

class User < ActiveRecord::Base
   searchable do
      text :email
   end
end

with one User with e-mail "[email protected]"

the following query :

search = User.search do 
   fulltext 'matsinopoulos'
end

does not bring any result, whereas:

search = User.search do
   fulltext '[email protected]'
end

brings.

Is there any configuration setting for sunspot to match sequences of characters instead of words?

Or, am I doing something wrong?


Solution

  • One needs to configure file:

    solr/conf/schema.xml
    

    The standard entry:

    <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    

    has to be turned to:

    <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.NGramFilterFactory"
                minGramSize="3"
                maxGramSize="30"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>`
    </fieldType>
    

    A very nice reference on Solr configuration can be found here:

    http://techbot.me/2011/01/full-text-search-in-in-rails-with-sunspot-and-solr/

    but, watch out that when it comes to partial words matching this reference talks about the EdgeNGramFilterFactory which indexes the beginnings of the words only. For making Solr match any part of the word, the NGramFilterFactory needs to be used.

    Note also that we have set minGramSize to 3 and maxGramSize to 30. So, patterns with length less than 3 or greater than 30 will not be returned in queries.