Working in R. Trying to calculate the similarity/distance of rows in a data.frame (each row is an item) from each other according to shared membership in groups (columns). However, I don't want 0 values (i.e. not being a member in a group) to contribute to the similarity. (What I want is kind of like Manhattan distance, but with different handling of 0's).
For example, for this dataset:
Group1 | Group2 | Group3 |
---|---|---|
0 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
1 | 1 | 0 |
1 | 0 | 1 |
0 | 1 | 1 |
1 | 1 | 1 |
I want a similarity matrix that looks like this:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |
0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 0 | 2 | 1 | 1 | 2 |
0 | 1 | 0 | 1 | 1 | 2 | 1 | 2 |
0 | 0 | 1 | 1 | 1 | 1 | 2 | 2 |
0 | 1 | 1 | 1 | 2 | 2 | 2 | 3 |
Note that the diagonal values aren't particularly important for my downstream applications, so alternative methods that give the same output as this but with a different diagonal are a fine solution for me.
Given the first matrix, some very very slow code that can calculate the second similarity matrix is:
calc_simil <- function(x) {
out <- matrix(nrow = nrow(x), ncol = nrow(x))
combos <- expand.grid(1:nrow(x), 1:nrow(x))
for (myrow in 1:nrow(combos)) {
temp <- x[c(combos[myrow, 1], combos[myrow, 2]), ]
out[combos[myrow, 1], combos[myrow, 2]] <-
out[combos[myrow, 2], combos[myrow, 1]] <-
sum((1-apply(temp, function(x) {any(x == 0)}, MARGIN = 2)) *
(1 - abs(temp[1, ] - temp[2, ])))
}
return(out)
}
I know there must be a more efficient way to do this, probably using some matrix multiplication wizardry, but I can't figure it out. I've also looked at various built-in methods to calculate distance, including some functions from R packages, but none seem to calculate this number of shared groups while ignoring shared absences from groups.
Anyone have any suggestions? Have I simply overlooked a common built-in distance method? Or is there some much faster way to calculate this distance/similarity?
You can simply do a tcrossprod. ie as.matrix(df) %*% t(df)
tcrossprod(as.matrix(df))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0 0 0 0 0 0 0 0
[2,] 0 1 0 0 1 1 0 1
[3,] 0 0 1 0 1 0 1 1
[4,] 0 0 0 1 0 1 1 1
[5,] 0 1 1 0 2 1 1 2
[6,] 0 1 0 1 1 2 1 2
[7,] 0 0 1 1 1 1 2 2
[8,] 0 1 1 1 2 2 2 3
>