Can someone explain me how this function of the library fuzzywuzzy in Python works? I know how the Levenshtein distance works but I don't understand how the ratio is computed.
b = fuzz.token_sort_ratio('controlled', 'comparative')
The result is 38
As you probably already know the Levenshtein distance is the minimum amount of insertions / deletions / substitutions to convert one sequence into another sequence. It can be normalized as dist / max_dist
, where max_dist
is the maximum distance possible given the two sequence lengths. In the case of the Levenshtein distance this results in the normalization dist / max(len(s1), len(s2))
. In addition a normalized similarity can be calculated by inverting this: 1 - normalized distance
.
>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.distance('controlled', 'comparative')
8
>>> Levenshtein.similarity('controlled', 'comparative')
3
>>> Levenshtein.normalized_distance('controlled', 'comparative')
0.7272727272727273
>>> Levenshtein.normalized_similarity('controlled', 'comparative')
0.2727272727272727
The Indel distance is the minimum amount of insertions / deletions to convert one sequence into another sequence. So it behaves similar to the Levenshtein distance, but does not allow substitutions. Since any substitution can be replaced by a insertion + deletion this can be calculated e.g. by modifying the cost of substitutions in the Levenshtein distance to 2. The Indel distance can be normalized in a similar way to the Levenshtein distance, but uses a different max_dist
: dist / (len(s1) + len(s2))
.
>>> from rapidfuzz.distance import Indel
>>> Indel.distance('controlled', 'comparative')
13
>>> Indel.similarity('controlled', 'comparative')
8
>>> Indel.normalized_distance('controlled', 'comparative')
0.6190476190476191
>>> Indel.normalized_similarity('controlled', 'comparative')
0.38095238095238093
The ratio
in fuzzywuzzy
/thefuzz
/rapidfuzz
is the normalized indel similarity scaled to 100.
>>> from rapidfuzz.distance import Indel
>>> from rapidfuzz import fuzz
>>> Indel.normalized_similarity('controlled', 'comparative') * 100
38.095238095238095
>>> fuzz.ratio('controlled', 'comparative')
38.095238095238095
The only difference in fuzzywuzzy
/thefuzz
is, that results are rounded:
>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio('controlled', 'comparative')
38
token_sort_ratio
is a variant of ratio
, which sorts the words in both sequences before comparing them:
s1 = " ".join(sorted(s1.split()))
s2 = " ".join(sorted(s2.split()))
fuzz.ratio(s1, s2)
In your example token_sort_ratio
will have the same result as ratio
, since both sequences are already sorted.