Search code examples
excelvbaduplicateslevenshtein-distance

Find near-duplicates of comma-separated lists using Levenshtein distance


This question based on the answer of my question yesterday.

To solve my problem, Jean-François Corbett suggested a Levenshtein distance approach. Then I found this code somewhere to get Levenshtein distance percentage.

Public Function GetLevenshteinPercentMatch( _
    ByVal string1 As String, ByVal string2 As String, _
    Optional Normalised As Boolean = False) As Single
Dim iLen As Integer
    If Normalised = False Then
        string1 = UCase$(WorksheetFunction.Trim(string1))
        string2 = UCase$(WorksheetFunction.Trim(string2))
    End If
    iLen = WorksheetFunction.Max(Len(string1), Len(string2))
    GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(string1, string2)) / iLen
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

  ' Step 1
  N = Len(s)
  m = Len(t)
  If N = 0 Then
    LevenshteinDistance = m
    Exit Function
  End If
  If m = 0 Then
    LevenshteinDistance = N
    Exit Function
  End If
  ReDim d(0 To N, 0 To m) As Integer

  ' Step 2
  For i = 0 To N
    d(i, 0) = i
  Next i

  For j = 0 To m
    d(0, j) = j
  Next j

  ' Step 3

  For i = 1 To N
    s_i = Mid$(s, i, 1)
    ' Step 4
    For j = 1 To m
      t_j = Mid$(t, j, 1)
      ' Step 5
      If s_i = t_j Then
        cost = 0
      Else
        cost = 1
      End If
      ' Step 6
      d(i, j) = WorksheetFunction.Min( _
          d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)

    Next j
  Next i

  ' Step 7
  LevenshteinDistance = d(N, m)

End Function

What I have now is a code that finds exact duplicates in one column,

 Dim duplicate(), i As Long
    Dim delrange As Range, cell As Long
    Dim shtIn As Worksheet, Shtout As Worksheet
    Dim numofrows1
    dim numofrows2
    dim j as long

    Set shtIn = ThisWorkbook.Sheets("process")
    Set Shtout = ThisWorkbook.Sheets("output")

    x = 2
    y = 1

    Set delrange = shtIn.Range("h1:h30000")  'set your range here

    ReDim duplicate(0)
    'search duplicates in 2nd column
    For cell = 1 To delrange.Cells.Count
         If Application.CountIf(delrange, delrange(cell)) > 1 Then
             ReDim Preserve duplicate(i)
             duplicate(i) = delrange(cell).Address
             i = i + 1
         End If
    Next

    'print duplicates
    For i = UBound(duplicate) To LBound(duplicate) Step -1
    Shtout.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
    x = x + 1
Next i

numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1

If Shtout.Cells(2, 1).Value = "" Then
    MsgBox ("No Duplicates Found!")
Else
    MsgBox (numofrows1 & " " & "Potential Duplicates Found")
End If

End Sub

I think that it will be nice if I can combine this two code, but Levenshtein distance is to compare 2 strings. So it can't work together.

I stuck here because I have no idea at all, every reference that I read all tell about comparing two string.

if the parameter this simple : detected as duplicate if the Levenshtein distance percentage is above 90%.

What I must change in this code?


Solution

  • I'm glad my earlier answer was useful to you. You didn't like having to represent each of your possible attributes by one-character symbols...

    Ok, as I try to signal to you in the comments there, it is possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. In fact it's quite straightforward to make this change:

    Before 'Step 1, convert your comma-separated strings into arrays like this:

    Dim sSplit() As String
    Dim tSplit() As String
    sSplit = Split(s, ",")
    tSplit = Split(t, ",")
    

    Then replace these four lines of code

    N = Len(s)
    m = Len(t)
    s_i = Mid$(s, i, 1)
    t_j = Mid$(t, j, 1)
    

    with these

    N = UBound(sSplit) + 1
    m = UBound(tSplit) + 1
    s_i = sSplit(i - 1)
    t_j = tSplit(j - 1)
    

    The + 1 and - 1 are there because Split returns a zero-based array.

    Example usage:

    ?LevenshteinDistance("valros,helmet,42","valros,helmet,42")
     0 
    ?LevenshteinDistance("valros,helmet,42","knight,helmet")
     2 
    ?LevenshteinDistance("helmet,iron,knight","plain,helmet")
     3 
    

    Note that 0 means the two strings are identical. You don't need separate code to deal with this.

    With the above you should be able to complete your task.

    One more note: the Damerau–Levenshtein distance may be a more relevant algorithm for you than the Levenshtein distance. The difference is that in addition to insertion/deletion/substitution, the D-M distance also considers transposition of two adjacent characters. Up to you to decide.