This question based on the answer of my question yesterday.
To solve my problem, Jean-François Corbett suggested a Levenshtein distance approach. Then I found this code somewhere to get Levenshtein distance percentage.
Public Function GetLevenshteinPercentMatch( _
ByVal string1 As String, ByVal string2 As String, _
Optional Normalised As Boolean = False) As Single
Dim iLen As Integer
If Normalised = False Then
string1 = UCase$(WorksheetFunction.Trim(string1))
string2 = UCase$(WorksheetFunction.Trim(string2))
End If
iLen = WorksheetFunction.Max(Len(string1), Len(string2))
GetLevenshteinPercentMatch = (iLen - LevenshteinDistance(string1, string2)) / iLen
End Function
'********************************
'*** Compute Levenshtein Distance
'********************************
Public Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
Dim d() As Integer ' matrix
Dim m As Integer ' length of t
Dim N As Integer ' length of s
Dim i As Integer ' iterates through s
Dim j As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost
' Step 1
N = Len(s)
m = Len(t)
If N = 0 Then
LevenshteinDistance = m
Exit Function
End If
If m = 0 Then
LevenshteinDistance = N
Exit Function
End If
ReDim d(0 To N, 0 To m) As Integer
' Step 2
For i = 0 To N
d(i, 0) = i
Next i
For j = 0 To m
d(0, j) = j
Next j
' Step 3
For i = 1 To N
s_i = Mid$(s, i, 1)
' Step 4
For j = 1 To m
t_j = Mid$(t, j, 1)
' Step 5
If s_i = t_j Then
cost = 0
Else
cost = 1
End If
' Step 6
d(i, j) = WorksheetFunction.Min( _
d(i - 1, j) + 1, d(i, j - 1) + 1, d(i - 1, j - 1) + cost)
Next j
Next i
' Step 7
LevenshteinDistance = d(N, m)
End Function
What I have now is a code that finds exact duplicates in one column,
Dim duplicate(), i As Long
Dim delrange As Range, cell As Long
Dim shtIn As Worksheet, Shtout As Worksheet
Dim numofrows1
dim numofrows2
dim j as long
Set shtIn = ThisWorkbook.Sheets("process")
Set Shtout = ThisWorkbook.Sheets("output")
x = 2
y = 1
Set delrange = shtIn.Range("h1:h30000") 'set your range here
ReDim duplicate(0)
'search duplicates in 2nd column
For cell = 1 To delrange.Cells.Count
If Application.CountIf(delrange, delrange(cell)) > 1 Then
ReDim Preserve duplicate(i)
duplicate(i) = delrange(cell).Address
i = i + 1
End If
Next
'print duplicates
For i = UBound(duplicate) To LBound(duplicate) Step -1
Shtout.Cells(x, 1).EntireRow.Value = shtIn.Range(duplicate(i)).EntireRow.Value
x = x + 1
Next i
numofrows2 = Shtout.Cells(Shtout.Rows.Count, 1).End(xlUp).Row - 1
If Shtout.Cells(2, 1).Value = "" Then
MsgBox ("No Duplicates Found!")
Else
MsgBox (numofrows1 & " " & "Potential Duplicates Found")
End If
End Sub
I think that it will be nice if I can combine this two code, but Levenshtein distance is to compare 2 strings. So it can't work together.
I stuck here because I have no idea at all, every reference that I read all tell about comparing two string.
if the parameter this simple : detected as duplicate if the Levenshtein distance percentage is above 90%.
What I must change in this code?
I'm glad my earlier answer was useful to you. You didn't like having to represent each of your possible attributes by one-character symbols...
Ok, as I try to signal to you in the comments there, it is possible to adapt the Levenshtein Distance algorithm to look not at each character in a string, but at each element of an array instead, and do comparisons based on that. In fact it's quite straightforward to make this change:
Before 'Step 1
, convert your comma-separated strings into arrays like this:
Dim sSplit() As String
Dim tSplit() As String
sSplit = Split(s, ",")
tSplit = Split(t, ",")
Then replace these four lines of code
N = Len(s)
m = Len(t)
s_i = Mid$(s, i, 1)
t_j = Mid$(t, j, 1)
with these
N = UBound(sSplit) + 1
m = UBound(tSplit) + 1
s_i = sSplit(i - 1)
t_j = tSplit(j - 1)
The + 1
and - 1
are there because Split
returns a zero-based array.
Example usage:
?LevenshteinDistance("valros,helmet,42","valros,helmet,42")
0
?LevenshteinDistance("valros,helmet,42","knight,helmet")
2
?LevenshteinDistance("helmet,iron,knight","plain,helmet")
3
Note that 0
means the two strings are identical. You don't need separate code to deal with this.
With the above you should be able to complete your task.
One more note: the Damerau–Levenshtein distance may be a more relevant algorithm for you than the Levenshtein distance. The difference is that in addition to insertion/deletion/substitution, the D-M distance also considers transposition of two adjacent characters. Up to you to decide.