Search code examples
pythonalgorithmcounterlevenshtein-distanceedit-distance

Adding exceptions to Levenshtein-Distance-like algorithm


I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.

By default, the distance in editing a variable (add/sub/change) is either 1 or 0. I'd like to change the distance depending on the variable and what value I set for that variable.

So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.

Here is my previous post as context: Modify Levenshtein-Distance to ignore order

Example:

# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}

# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}

>>> distance('woke', 'woc')
0.8

Explanation:

woke
k -> c = 1
woce
-e = 1
woc
Distance = 2

# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8

How could I achieve this? Would this be possible to implement with this Counter algorithm?

Current code (thank you David Eisenstat)

def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2

Solution

  • I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.

    Summing up what I was trying to achieve (in relation to Edit-distance algorithms)

    • Distance from one letter to another is 1. change j -> k = 1
    • 0 being no difference at all. e.g. change j -> j = 0
    • Similar letters can be worth less than 1 (specified by me) e.g. c and k sound the same, therefore c, k = 0.5, change c -> k = 0.5
    • Certain letters could be worth more or less (specified by me) e.g. x is uncommon so I want it to have more weight, x = 1.4, change x -> k = 1.4

    Created 2 dictionaries, 1 for similar letters, 1 for exceptions

    1. Populate Counter - Iterate through both strings
    2. Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict
    3. Update Counter - remove similar items,
    4. Find Distance - add up absolute frequencies, account for difference in string length
    5. Include exceptions distance - Account for exception values based on frequency of letters