how token sort ratio works?

Can someone explain me how this function of the library fuzzywuzzy in Python works? I know how the Levenshtein distance works but I don't understand how the ratio is computed.

b = fuzz.token_sort_ratio('controlled', 'comparative')

The result is 38

Solution

Levenshtein distance

As you probably already know the Levenshtein distance is the minimum amount of insertions / deletions / substitutions to convert one sequence into another sequence. It can be normalized as dist / max_dist, where max_dist is the maximum distance possible given the two sequence lengths. In the case of the Levenshtein distance this results in the normalization dist / max(len(s1), len(s2)). In addition a normalized similarity can be calculated by inverting this: 1 - normalized distance.

>>> from rapidfuzz.distance import Levenshtein
>>> Levenshtein.distance('controlled', 'comparative')
8
>>> Levenshtein.similarity('controlled', 'comparative')
3
>>> Levenshtein.normalized_distance('controlled', 'comparative')
0.7272727272727273
>>> Levenshtein.normalized_similarity('controlled', 'comparative')
0.2727272727272727

Indel distance

The Indel distance is the minimum amount of insertions / deletions to convert one sequence into another sequence. So it behaves similar to the Levenshtein distance, but does not allow substitutions. Since any substitution can be replaced by a insertion + deletion this can be calculated e.g. by modifying the cost of substitutions in the Levenshtein distance to 2. The Indel distance can be normalized in a similar way to the Levenshtein distance, but uses a different max_dist: dist / (len(s1) + len(s2)).

>>> from rapidfuzz.distance import Indel
>>> Indel.distance('controlled', 'comparative')
13
>>> Indel.similarity('controlled', 'comparative')
8
>>> Indel.normalized_distance('controlled', 'comparative')
0.6190476190476191
>>> Indel.normalized_similarity('controlled', 'comparative')
0.38095238095238093

ratio

The ratio in fuzzywuzzy/thefuzz/rapidfuzz is the normalized indel similarity scaled to 100.

>>> from rapidfuzz.distance import Indel
>>> from rapidfuzz import fuzz
>>> Indel.normalized_similarity('controlled', 'comparative') * 100
38.095238095238095
>>> fuzz.ratio('controlled', 'comparative')
38.095238095238095

The only difference in fuzzywuzzy/thefuzz is, that results are rounded:

>>> from fuzzywuzzy import fuzz
>>> fuzz.ratio('controlled', 'comparative')
38

token_sort_ratio

token_sort_ratio is a variant of ratio, which sorts the words in both sequences before comparing them:

s1 = " ".join(sorted(s1.split()))
s2 = " ".join(sorted(s2.split()))
fuzz.ratio(s1, s2)

In your example token_sort_ratio will have the same result as ratio, since both sequences are already sorted.