Search code examples
excelvbaduplicatessimilarity

Estimating duplication percentage between rows in Excel


I have an Excel (2010) data file with over 200 variables (columns), and over 1,000 records (rows), each identified with a unique id number. However, I'm suspicious that some of these records are fabricated, i.e., someone took an existing record, replicated it, and changed only a few numbers to make it a little different. Therefore, I need to produce a matrix that will show me the number/percent of "same values" between each record and all other records (e.g., record 1 and record 2 share 75 equal values, record 1 and record 3 share 57 equal values, record 2 and record 3 share 45 equal values, etc.). I have a few workarounds, but they take hours and don't produce a simple matrix. I don't care about the difference between the values - just whether they are equal or not. Any ideas will be greatly appreciated!


Solution

  • Don't know how well this will perform on a huge dataset but:

    Sub T()
    
        Dim d, m(), nR As Long, nC As Long, r As Long, r2 As Long, c As Long
        Dim v1, v2, i As Long
        d = Sheet1.Range("A1").CurrentRegion.Value
        nR = UBound(d, 1)
        nC = UBound(d, 2)
        ReDim m(1 To nR, 1 To nR)
    
        For r = 1 To nR
            For r2 = r To nR
                i = 0
                For c = 1 To nC
                    v1 = d(r, c): If IsError(v1) Then v1 = "Error!"
                    v2 = d(r2, c): If IsError(v2) Then v2 = "Error!"
                    If v1 = v2 Then i = i + 1
                Next c
                m(r2, r) = i
            Next r2
        Next r
    
        With Sheet2
            .Range("B2").Resize(nR, nR).Value = m
            'assuming your id's are in the first column...
            For r = 1 To nR
                .Cells(1 + r, 1) = d(r, 1)
                .Cells(r, r + 1) = d(r, 1)
            Next r
        End With
    
    End Sub