Search code examples
solrlucenesolr4stemming

How to do an exact search on field which uses keywordTokenizer and stemming filter


I want to do an exact match on a field which is stemmed. Eg.My data has this value :- "Babysitters at work"

<fieldType name="string_ci_stem" class="solr.TextField" sortMissingLast="true" omitNorms="true">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory"/>           
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.SnowballPorterFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.SnowballPorterFilterFactory"/>
        </analyzer>

The document getting indexed is "babysitters at work" instead of "babysit at work". I have seen that solr only stems the last word of the sentence when the keywordTokenizer is used.

Is there a way to index "Babysitters at work" as "babysit at work", such that :-

"babysit at work" - return result "babysit work" - doesnot return result.

Any other schema.xml definations which will help to achieve the results?

Any help will be appreciated.

Edit : Updated the question.


Solution

  • KeywordTokenizerFactory is not designed for your usage as it will index the whole input wihtout spliting input text into tokens like that "Babysitters" "at" "work". You'll get what you want with solr.StandardTokenizerFactory instead of solr.KeywordTokenizerFactory. More info here : https://cwiki.apache.org/confluence/display/solr/Tokenizers

    Then if you want to do single term query you'll have to concatenate the emitted tokens into one. I don't know if this kind of filter is available in solr but it should be pretty easy to create your own based on this thread : http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

    1. Babysitters at work -> StandardTokenizer -> "babysitters" "at" "work"
    2. "babysitters" "at" "work" -> stemming -> "babysit" "at" "work"
    3. "babysit" "at" "work" -> Your Concatenate Filter -> "babysit at work"