Search code examples
solrwildcardfuzzy-search

Fuzzy search a part of the whole text in Solr


I have the following field declaration for my Solr index:

<field name="description" type="text_ci" indexed="true" multiValued="false" required="true"/>

Field type:

<fieldType name="text_ci" class="solr.TextField" omitNorms="true" sortMissingLast="true">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType> 

In this index I have documents, where description value is like "Accomodation in {city}" (they all have different cities)

I want to make a fuzzy search and when I enter misspelled *acomodation*~2 for example to get results, but I find it difficult, because "accomodation" is just a part of the text.

I am thinking of using NGramFilter to tokenize the input, but I am not sure if this is the right way and how to implement it.

Do you know, what I can do?


Solution

  • Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term.

    I don't see a need of NGramFilter here.

    ~ operator is used to run fuzzy searches. You need to add ~ operator after every single term and can also specify edit distance which is optional after that as below.

    {FIELD_NAME:TERM_1~{Edit_Distance}
    

    Your request will look like below.

    http://localhost:8983/solr/FuzzySearchExample/select?indent=on&q=desc:Samsu~&wt=json&fl=id,desc
    

    I had the field type as below.

    <fieldType name="text_ci" class="solr.TextField" omitNorms="true" sortMissingLast="true">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
    

    I get the below response for acomodation~2 or acomodation~1

    Screenshot od solr query page

    And I get the below response for acomodation.

    Screenshot of query page2