solr lucene levenshtein-distance fuzzy-search

Fuzzy search with 1 distance does not works for other languages in Solr

I have documents with fields name_en, name_de, name_fr etc. And words cutter in english and mutter in german. If I fuzzy-search with name_en:cuter~1 (with only one t) it works fine, but if I search for name_de:muter~1 it just does not return any result.

However it works with fuzzy distance 2. So name_de:muter~2 works correct and return mutter. The languages have different analyzers in schema.xml, so this should be the difference. But it is still not clear why for german distance 1 does not work.

Here is config for german

<analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory" />
  <filter class="solr.ManagedStopFilterFactory" managed="de" />
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.ShingleFilterFactory"/>
  <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
  <filter class="solr.GermanStemFilterFactory" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

Could someone explain why distance is 2, but not 1. As I can observe, distance between mutter and muter is 1, not 2.

Solution

This happens because mutter is truncated by the german stemmer and get indexed as mutt, where cutter appears to be left untouched by most english stemmers (tested with Porter and Snowball/Porter2 algorithms, known to be the most aggressive) :

The edit distance for cuter to match cutter is 1.
The edit distance for muter to match mutt is 2.

In order to make the fuzzy search work as expected, you need to preserve the original (unstemmed) tokens in the analysis chain so that they get indexed too and thus can be matched properly by the distance algorithm at query time.

A simple solution is to use the KeywordRepeatFilterFactory, placed before the stemmer, so that the unstemmed tokens are preserved and indexed at the same position as the stemmed one. Otherwise you would have to use a specific field type.

You might also have the same kind of issues with wildcard queries, for the same reason, and the solutions would be the same.

Nb. I noticed you are using a shingle filter, it's important to place the keyword repeater after the shingle filter, so that repeated unigrams can be stemmed and repeated shingles removed by the duplicate filter, otherwise shingles would be made of repeated keywords.