Search code examples
c#levenshtein-distance

Levenshtein compare strings without changing numbers


I'm looking for a method to find similar symbolnames, where those names are often a combination of text an numbers, like "value1", "_value2", "test_5" etc.

Now to find similar names I tried using the Levenshtein distance, but for the algorithm the difference between a "_value1" and ".value1" is the same as for "_value1" and "_value8". Is there a way to compare strings without allowing to change numbers?

The code I'm currently using is from http://www.dotnetperls.com/levenshtein

Thanks in advance!


Solution

  • You can give any unequal comparison that involves a numeral a very high distance, like 200. This will keep a distance of 1 (similar) between "_text1" and ".text1", but a distance of 200 (very dissimilar) between "text1" and "text10".

    You would do this by changing steps two ...

    // Step 2
    d[0, 0] = 0;
    
    for (int i = 1; i <= n; i++);
    {
        if('0' <= s[i - 1] && s[i - 1] <= '9')
            d[i, 0] = d[i-1, 0] + 200;
        else
            d[i, 0] = d[i-1, 0] + 1;
    }
    
    
    for (int j = 1; j <= m; j++)
    {
        if('0' <= t[j - 1] && t[j - 1] <= '9')
            d[0, j] = d[0, j-1] + 200;
        else
            d[0, j] = d[0, j-1] + 1;
    }
    

    ... and five ...

    // Step 5
    int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
    if(('0' <= t[j - 1] && t[j - 1] <= '9') ||
        '0' <= s[i - 1] && s[i - 1] <= '9'))
            cost *= 200;