Search code examples
pythonexcelvbasimilarity

Text similarity analysis (Excel)


I have a list of items and I want to identify their similarity in relation to the other items in this list.

My desired output would be something along the lines of: enter image description here

The percentage shown in the similarity column is purely illustrative. I'm thinking that a test for similarity would be something along the lines of:

number of concurrent letters / by the total number of letters in the matched item

But would be keen to get opinions on that one.

Is this something which is reasonably doable on Excel? I'ts a small data set (140kb) containing only alphanumeric values.

Am also open to alternative ways of approaching this, as I haven't tackled anything like this before!

P.s. I've been learning Python for a few months now, so suggestions using Python would also be good!


Solution

  • Here is a solution using a VBA UDF:

    EDIT: Added a new optional argument named arg_lMinConsecutive which is used to determine the minimum number of consecutive characters that must match. Note the extra argument 2 in the below formulas which indicates that at least 2 consecutive characters must match.

    Public Function FuzzyMatch(ByVal arg_sText As String, _
                               ByVal arg_vList As Variant, _
                               ByVal arg_lOutput As Long, _
                               Optional ByVal arg_lMinConsecutive As Long = 1, _
                               Optional ByVal arg_bMatchCase As Boolean = True, _
                               Optional ByVal arg_bExactCount As Boolean = True) _
                    As Variant
    
        Dim dExactCounts As Object
        Dim aResults() As Variant
        Dim vList As Variant
        Dim vListItem As Variant
        Dim sLetter As String
        Dim dMaxMatch As Double
        Dim lMaxIndex As Long
        Dim lResultIndex As Long
        Dim lLastMatch As Long
        Dim i As Long
        Dim bMatch As Boolean
    
        If arg_lMinConsecutive <= 0 Then
            FuzzyMatch = CVErr(xlErrNum)
            Exit Function
        End If
    
        If arg_bExactCount = True Then Set dExactCounts = CreateObject("Scripting.Dictionary")
    
        If TypeName(arg_vList) = "Collection" Or TypeName(arg_vList) = "Range" Then
            ReDim aResults(1 To arg_vList.Count, 1 To 3)
            Set vList = arg_vList
        ElseIf IsArray(arg_vList) Then
            ReDim aResults(1 To UBound(arg_vList) - LBound(arg_vList) + 1, 1 To 3)
            vList = arg_vList
        Else
            ReDim vList(1 To 1)
            vList(1) = arg_vList
            ReDim aResults(1 To 1, 1 To 3)
        End If
    
        dMaxMatch = 0#
        lMaxIndex = 0
        lResultIndex = 0
    
        For Each vListItem In vList
            If vListItem <> arg_sText Then
                lLastMatch = -arg_lMinConsecutive
                lResultIndex = lResultIndex + 1
                aResults(lResultIndex, 3) = vListItem
                If arg_bExactCount Then dExactCounts.RemoveAll
                For i = 1 To Len(arg_sText) - arg_lMinConsecutive + 1
                    bMatch = False
                    sLetter = Mid(arg_sText, i, arg_lMinConsecutive)
                    If Not arg_bMatchCase Then sLetter = LCase(sLetter)
                    If arg_bExactCount Then dExactCounts(sLetter) = dExactCounts(sLetter) + 1
    
                    Select Case Abs(arg_bMatchCase) + Abs(arg_bExactCount) * 2
                        Case 0
                            'MatchCase is false and ExactCount is false
                            If InStr(1, vListItem, sLetter, vbTextCompare) > 0 Then bMatch = True
    
                        Case 1
                            'MatchCase is true and ExactCount is false
                            If InStr(1, vListItem, sLetter) > 0 Then bMatch = True
    
                        Case 2
                            'MatchCase is false and ExactCount is true
                            If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString, Compare:=vbTextCompare)) >= dExactCounts(sLetter) Then bMatch = True
    
                        Case 3
                            'MatchCase is true and ExactCount is true
                            If Len(vListItem) - Len(Replace(vListItem, sLetter, vbNullString)) >= dExactCounts(sLetter) Then bMatch = True
    
                    End Select
    
                    If bMatch Then
                        aResults(lResultIndex, 1) = aResults(lResultIndex, 1) + WorksheetFunction.Min(arg_lMinConsecutive, i - lLastMatch)
                        lLastMatch = i
                    End If
                Next i
                If Len(vListItem) > 0 Then
                    aResults(lResultIndex, 2) = aResults(lResultIndex, 1) / Len(vListItem)
                    If aResults(lResultIndex, 2) > dMaxMatch Then
                        dMaxMatch = aResults(lResultIndex, 2)
                        lMaxIndex = lResultIndex
                    End If
                Else
                    aResults(lResultIndex, 2) = 0
                End If
            End If
        Next vListItem
    
        If dMaxMatch = 0# Then
            Select Case arg_lOutput
                Case 1:     FuzzyMatch = 0
                Case 2:     FuzzyMatch = vbNullString
                Case Else:  FuzzyMatch = CVErr(xlErrNum)
            End Select
        Else
            Select Case arg_lOutput
                Case 1:     FuzzyMatch = Application.Min(1, aResults(lMaxIndex, 2))
                Case 2:     FuzzyMatch = aResults(lMaxIndex, 3)
                Case Else:  FuzzyMatch = CVErr(xlErrNum)
            End Select
        End If
    
    End Function
    

    Using only the original data in columns A and B, you can use this UDF to get the desired results in columns C and D:

    enter image description here

    In cell C2 and copied down is this formula:

    =FuzzyMatch($B2,$B$2:$B$6,COLUMN(A2),2)
    

    In cell D2 and copied down is this formula:

    =IFERROR(INDEX(A:A,MATCH(FuzzyMatch($B2,$B$2:$B$6,COLUMN(B2),2),B:B,0)),"-")
    

    Note that they both use the FuzzyMatch UDF.