Search code examples
solrsunspotsunspot-railssunspot-solr

How to make Solr search by short words?


I've got an item that says "4k display" and when I search for "4k display" that item does not seem to be prioritized and other items with "display" (without 4k) come up.

If I search for "4k" nothing shows up.

What in the config should I change to remedy this?

Update: This is how the text type part looks like, likely setup by the sunspot gem.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <!--<filter class="solr.StandardFilterFactory"/>-->
    <filter class="solr.LowerCaseFilterFactory"/>
    <!--<filter class="solr.KStemFilterFactory"/>-->
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="7"/>
  </analyzer>
</fieldType>

The minGram size looks like the culrpit?


Solution

  • So lets walk through your analysis chain. First comes Standard Tokenizer. It will split on whitespaces. So "4K display" will split into two tokens

    4k,display

    Next one is lowercaseFilter. which will lower case the tokens so in this case nothing will change as its already lowercased. So by end of this step you still have the same two tokens

    4k,display

    Now comes the NGramFilterFactory which will start creating tokens like this. so e.g if you have a token called "abcd"

    Ngram will produce tokens like this.

    a,ab,abc,abcd,b, bc,bcd,c,cd,d
    

    But there is another option defined in the ngram field type

    minGramSize="3" maxGramSize="7"

    Which means that only retain the tokens which have min lenght of 3 and max of 7. so in the above example you will only see

    abc,abcd,bcd

    So far with me.

    Now lets apply it to your case. After lowercase filter we had two tokens

    4k,display

    Applying Ngram on both will produce following

    4,4k,k,d,di,dis,disp,displ,displa,display,i,isp and so on . You get the idea.

    But since miggram size is 3. 4 and 4k will be dropped from your index. Hence you are not able to search using 4k. Because it was never in the index.

    your index only has value starting with dis like

    dis,disp,displ,displa,display

    In order to fix this. First you need to understand how you want to search your data.

    Do you really need NGRamtokenizer ?

    e.g IF you just want to get exact matches. e.g when you query "4k display", you want only results which has either "4k" or "display" or "4k display" then you need to change the your analysis chain.

    Comment out the NGram from your analyis chain in that case and reindex and try querying again.