Search code examples
rmatrixlinear-algebradistancesimilarity

Calculating a similarity matrix in R counting only shared columns of binary data


Working in R. Trying to calculate the similarity/distance of rows in a data.frame (each row is an item) from each other according to shared membership in groups (columns). However, I don't want 0 values (i.e. not being a member in a group) to contribute to the similarity. (What I want is kind of like Manhattan distance, but with different handling of 0's).

For example, for this dataset:

Group1 Group2 Group3
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1

I want a similarity matrix that looks like this:

1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1
0 0 1 0 1 0 1 1
0 0 0 1 0 1 1 1
0 1 1 0 2 1 1 2
0 1 0 1 1 2 1 2
0 0 1 1 1 1 2 2
0 1 1 1 2 2 2 3

Note that the diagonal values aren't particularly important for my downstream applications, so alternative methods that give the same output as this but with a different diagonal are a fine solution for me.

Given the first matrix, some very very slow code that can calculate the second similarity matrix is:

calc_simil <- function(x) {
  out <- matrix(nrow = nrow(x), ncol = nrow(x))
  combos <- expand.grid(1:nrow(x), 1:nrow(x))
  for (myrow in 1:nrow(combos)) {
    temp <- x[c(combos[myrow, 1], combos[myrow, 2]), ]
    out[combos[myrow, 1], combos[myrow, 2]] <-
      out[combos[myrow, 2], combos[myrow, 1]] <-
      sum((1-apply(temp, function(x) {any(x == 0)}, MARGIN = 2)) *
      (1 - abs(temp[1, ] - temp[2, ])))
  }
  return(out)
}

I know there must be a more efficient way to do this, probably using some matrix multiplication wizardry, but I can't figure it out. I've also looked at various built-in methods to calculate distance, including some functions from R packages, but none seem to calculate this number of shared groups while ignoring shared absences from groups.

Anyone have any suggestions? Have I simply overlooked a common built-in distance method? Or is there some much faster way to calculate this distance/similarity?


Solution

  • You can simply do a tcrossprod. ie as.matrix(df) %*% t(df)

    tcrossprod(as.matrix(df))
    
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
    [1,]    0    0    0    0    0    0    0    0
    [2,]    0    1    0    0    1    1    0    1
    [3,]    0    0    1    0    1    0    1    1
    [4,]    0    0    0    1    0    1    1    1
    [5,]    0    1    1    0    2    1    1    2
    [6,]    0    1    0    1    1    2    1    2
    [7,]    0    0    1    1    1    1    2    2
    [8,]    0    1    1    1    2    2    2    3
    >