Search code examples
searchsolrfuzzy

Solr Fuzzy search (max 2 edits)


I am using Solr 6.0.0

I am using data-driven-configuration for my configuration related purpose. Most of the configuration is standard.

I have a document in Solr with

name:"aquickbrownfox"

Now if I do a fuzzy search like:

name:aquickbrownfo~0.7 OR name:aquickbrownf~0.7

It lists out the record in the results.

But if I do a search like:

name:aquickbrown~0.7

It does not list the record.

Does it have to do something with the maxEdits in solrconfig.xml which is set to 2 ?

I tried increasing it. But I could not create a collection with this configuration. It gave an error:

ERROR: Error CREATEing SolrCore 'my-search': Unable to create core [my-search] Caused by: Invalid maxEdits

Max 2 Edits seems to be a serious limitation. I wonder what is the use of passing the fractional value after the ~ operator.

My Usecase:

I have a contact database. I am supposed to detect the duplicates based on three parameters : Name, Email and Phone. So I rely on Solr for Fuzzy search. Email and Phone are relatively easy to work with simple assumptions. Name seems to be a bit tricky. For each word in the Name, I plan to do a fuzzy search. I expected the optional parameter after ~ to work without the maxEdit distance limitation.


Solution

  • The documentation no longer suggests using a fractional value after the tilde - see http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fuzzy_Searches for more information.

    However, you are correct that only 2 changes are allowed to be made to the search string in order to carry out a fuzzy search. I would guess this limitation strikes a balance between efficiency and usefulness.

    The maxEdits parameter in solrconfig.xml applies to the DirectSpellChecker configuration, and doesn't affect your searching, unless you're using the spell checker.

    For your use case, your best approach may be to index the name field twice, using different field configurations: one using a simple set of analyzers and filters (ie. StandardTokenizerFactory, StandardFilterFactory, LowerCaseFilterFactory), and the other using a phonetic matcher such as the Beider-Morse filter. You can use the first field to carry out fuzzy searches, and the second version to look for names which may be spelled differently but sound the same as the name being checked.