Search code examples
c#algorithmlevenshtein-distancefuzzy

Damerau–Levenshtein distance algorithm, disable counting of delete


How can i disable counting of deletion, in this implementation of Damerau-Levenshtein distance algorithm, or if there is other algorithm already implemented please point me to it.

Example(disabled deletion counting):

string1: how are you?

string2: how oyu?

distance: 1 (for transposition, 4 deletes doesn't count)

And here is the algorithm:

    public static int DamerauLevenshteinDistance(string string1, string string2, int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;

        // Return trivial case - where one is empty
        if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
            return (string1 ?? "").Length + (string2 ?? "").Length;


        // Ensure string2 (inner cycle) is longer_transpositionRow
        if (string1.Length > string2.Length)
        {
            var tmp = string1;
            string1 = string2;
            string2 = tmp;
        }

        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            return string2.Length - string1.Length;

        var length1 = string1.Length;
        var length2 = string2.Length;

        var d = new int[length1 + 1, length2 + 1];

        for (var i = 0; i <= d.GetUpperBound(0); i++)
            d[i, 0] = i;

        for (var i = 0; i <= d.GetUpperBound(1); i++)
            d[0, i] = i;

        for (var i = 1; i <= d.GetUpperBound(0); i++)
        {
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;

                var del = d[im1, j] + 1;
                var ins = d[i, jm1] + 1;
                var sub = d[im1, jm1] + cost;

                //Math.Min is slower than native code
                //d[i, j] = Math.Min(del, Math.Min(ins, sub));
                d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;

                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost);

                if (d[i, j] < minDistance)
                    minDistance = d[i, j];
            }

            if (minDistance > threshold)
                return int.MaxValue;
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold
            ? int.MaxValue
            : d[d.GetUpperBound(0), d.GetUpperBound(1)];
    }

Solution

  •  public static int DamerauLevenshteinDistance( string string1
                                                , string string2
                                                , int threshold)
    {
        // Return trivial case - where they are equal
        if (string1.Equals(string2))
            return 0;
    
        // Return trivial case - where one is empty
        // WRONG FOR YOUR NEEDS: 
        // if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2))
        //      return (string1 ?? "").Length + (string2 ?? "").Length;
    
        //DO IT THIS WAY:
        if (String.IsNullOrEmpty(string1))
            // First string is empty, so every character of 
            // String2 has been inserted:
            return (string2 ?? "").Length;
        if (String.IsNullOrEmpty(string2))
            // Second string is empty, so every character of string1 
            // has been deleted, but you dont count deletions:
            return 0;
    
        // DO NOT SWAP THE STRINGS IF YOU WANT TO DEAL WITH INSERTIONS
        // IN A DIFFERENT MANNER THEN WITH DELETIONS:
        // THE FOLLOWING IS WRONG FOR YOUR NEEDS:
        // // Ensure string2 (inner cycle) is longer_transpositionRow
        // if (string1.Length > string2.Length)
        // {
        //     var tmp = string1;
        //     string1 = string2;
        //     string2 = tmp;
        // }
    
        // Return trivial case - where string1 is contained within string2
        if (string2.Contains(string1))
            //all changes are insertions
            return string2.Length - string1.Length;
    
        // REVERSE CASE: STRING2 IS CONTAINED WITHIN STRING1
        if (string1.Contains(string2))
            //all changes are deletions which you don't count:
            return 0;
    
        var length1 = string1.Length;
        var length2 = string2.Length;
    
    
        // PAY ATTENTION TO THIS CHANGE!
        // length1+1 rows is way too much! You need only 3 rows (0, 1 and 2)
        // read my explanation below the code!
        // TOO MUCH ROWS: var d = new int[length1 + 1, length2 + 1];
        var d = new int[2, length2 + 1];
    
        // THIS INITIALIZATION COUNTS DELETIONS. YOU DONT WANT IT
        // or (var i = 0; i <= d.GetUpperBound(0); i++)
        //    d[i, 0] = i;
    
        // But you must initiate the first element of each row with 0:
        for (var i = 0; i <= 2; i++)
            d[i, 0] = 0;
    
    
        // This initialization counts insertions. You need it, but for
        // better consistency of code I call the variable j (not i):
        for (var j = 0; j <= d.GetUpperBound(1); j++)
            d[0, j] = j;
    
    
        // Now do the job:
        // for (var i = 1; i <= d.GetUpperBound(0); i++)
        for (var i = 1; i <= length1; i++)
        {
            //Here in this for-loop: add "%3" to evey term 
            // that is used as first index of d!
    
            var im1 = i - 1;
            var im2 = i - 2;
            var minDistance = threshold;
            for (var j = 1; j <= d.GetUpperBound(1); j++)
            {
                var jm1 = j - 1;
                var jm2 = j - 2;
                var cost = string1[im1] == string2[jm1] ? 0 : 1;
    
                // DON'T COUNT DELETIONS!  var del = d[im1, j] + 1;
                var ins = d[i % 3, jm1] + 1;
                var sub = d[im1 % 3, jm1] + cost;
    
                // Math.Min is slower than native code
                // d[i, j] = Math.Min(del, Math.Min(ins, sub));
                // DEL DOES NOT EXIST  
                // d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub;
                d[i % 3, j] = ins <= sub ? ins : sub;
    
                if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1])
                    d[i % 3, j] = Math.Min(d[i % 3, j], d[im2 % 3, jm2] + cost);
    
                if (d[i % 3, j] < minDistance)
                    minDistance = d[i % 3, j];
            }
    
            if (minDistance > threshold)
                return int.MaxValue;
        }
    
        return d[length1 % 3, d.GetUpperBound(1)] > threshold
            ? int.MaxValue
            : d[length1 % 3, d.GetUpperBound(1)];
    }
    

    here comes my explanation why you need only 3 rows:

    Look at this line:

    var d = new int[length1 + 1, length2 + 1];
    

    If one string has the length n and the other has the length m, then your code needs a space of (n+1)*(m+1) integers. Each Integer needs 4 Byte. This is waste of memory if your strings are long. If both strings are 35.000 byte long, you will need more than 4 GB of memory!

    In this code you calculate and write a new value for d[i,j]. And to do this, you read values from its upper neighbor (d[i,jm1]), from its left neighbor (d[im1,j]), from its upper-left neighbor (d[im1,jm1]) and finally from its double-upper-double-left neighbour (d[im2,jm2]). So you just need values from your actual row and 2 rows before.

    You never need values from any other row. So why do you want to store them? Three rows are enough, and my changes make shure, that you can work with this 3 rows without reading any wrong value at any time.