I have an Excel (2010) data file with over 200 variables (columns), and over 1,000 records (rows), each identified with a unique id number. However, I'm suspicious that some of these records are fabricated, i.e., someone took an existing record, replicated it, and changed only a few numbers to make it a little different. Therefore, I need to produce a matrix that will show me the number/percent of "same values" between each record and all other records (e.g., record 1 and record 2 share 75 equal values, record 1 and record 3 share 57 equal values, record 2 and record 3 share 45 equal values, etc.). I have a few workarounds, but they take hours and don't produce a simple matrix. I don't care about the difference between the values - just whether they are equal or not. Any ideas will be greatly appreciated!
Don't know how well this will perform on a huge dataset but:
Sub T()
Dim d, m(), nR As Long, nC As Long, r As Long, r2 As Long, c As Long
Dim v1, v2, i As Long
d = Sheet1.Range("A1").CurrentRegion.Value
nR = UBound(d, 1)
nC = UBound(d, 2)
ReDim m(1 To nR, 1 To nR)
For r = 1 To nR
For r2 = r To nR
i = 0
For c = 1 To nC
v1 = d(r, c): If IsError(v1) Then v1 = "Error!"
v2 = d(r2, c): If IsError(v2) Then v2 = "Error!"
If v1 = v2 Then i = i + 1
Next c
m(r2, r) = i
Next r2
Next r
With Sheet2
.Range("B2").Resize(nR, nR).Value = m
'assuming your id's are in the first column...
For r = 1 To nR
.Cells(1 + r, 1) = d(r, 1)
.Cells(r, r + 1) = d(r, 1)
Next r
End With
End Sub