Search code examples
pythonlevenshtein-distance

how token sort ratio works?


Can someone explain me how this function of the library fuzzywuzzy in Python works? I know how the Levenshtein distance works but I don't understand how the ratio is computed.

b = fuzz.token_sort_ratio('controlled', 'comparative')

The result is 38


Solution

  • Levenshtein distance

    As you probably already know the Levenshtein distance is the minimum amount of insertions / deletions / substitutions to convert one sequence into another sequence. It can be normalized as dist / max_dist, where max_dist is the maximum distance possible given the two sequence lengths. In the case of the Levenshtein distance this results in the normalization dist / max(len(s1), len(s2)). In addition a normalized similarity can be calculated by inverting this: 1 - normalized distance.

    >>> from rapidfuzz.distance import Levenshtein
    >>> Levenshtein.distance('controlled', 'comparative')
    8
    >>> Levenshtein.similarity('controlled', 'comparative')
    3
    >>> Levenshtein.normalized_distance('controlled', 'comparative')
    0.7272727272727273
    >>> Levenshtein.normalized_similarity('controlled', 'comparative')
    0.2727272727272727
    

    Indel distance

    The Indel distance is the minimum amount of insertions / deletions to convert one sequence into another sequence. So it behaves similar to the Levenshtein distance, but does not allow substitutions. Since any substitution can be replaced by a insertion + deletion this can be calculated e.g. by modifying the cost of substitutions in the Levenshtein distance to 2. The Indel distance can be normalized in a similar way to the Levenshtein distance, but uses a different max_dist: dist / (len(s1) + len(s2)).

    >>> from rapidfuzz.distance import Indel
    >>> Indel.distance('controlled', 'comparative')
    13
    >>> Indel.similarity('controlled', 'comparative')
    8
    >>> Indel.normalized_distance('controlled', 'comparative')
    0.6190476190476191
    >>> Indel.normalized_similarity('controlled', 'comparative')
    0.38095238095238093
    

    ratio

    The ratio in fuzzywuzzy/thefuzz/rapidfuzz is the normalized indel similarity scaled to 100.

    >>> from rapidfuzz.distance import Indel
    >>> from rapidfuzz import fuzz
    >>> Indel.normalized_similarity('controlled', 'comparative') * 100
    38.095238095238095
    >>> fuzz.ratio('controlled', 'comparative')
    38.095238095238095
    

    The only difference in fuzzywuzzy/thefuzz is, that results are rounded:

    >>> from fuzzywuzzy import fuzz
    >>> fuzz.ratio('controlled', 'comparative')
    38
    

    token_sort_ratio

    token_sort_ratio is a variant of ratio, which sorts the words in both sequences before comparing them:

    s1 = " ".join(sorted(s1.split()))
    s2 = " ".join(sorted(s2.split()))
    fuzz.ratio(s1, s2)
    

    In your example token_sort_ratio will have the same result as ratio, since both sequences are already sorted.