Search code examples
rdistancematchingsimilaritymetric

Simple matching similarity matrix for continuous, non-binary data?


Given the matrix

structure(list(X1 = c(1L, 2L, 3L, 4L, 2L, 5L), X2 = c(2L, 3L, 
4L, 5L, 3L, 6L), X3 = c(3L, 4L, 4L, 5L, 3L, 2L), X4 = c(2L, 4L, 
6L, 5L, 3L, 8L), X5 = c(1L, 3L, 2L, 4L, 6L, 4L)), .Names = c("X1", 
"X2", "X3", "X4", "X5"), class = "data.frame", row.names = c(NA, 
-6L))

I want to create a 5 x 5 distance matrix with the ratio of matches and the total number of rows between all columns. For instance, the distance between X4 and X3 should be 0.5, given that both columns match 3 out of 6 times.

I have tried using dist(test, method="simple matching") from package "proxy" but this method only works for binary data.


Solution

  • Using outer (again :-)

    my.dist <- function(x) {
     n <- nrow(x)
     d <- outer(seq.int(ncol(x)), seq.int(ncol(x)),
                Vectorize(function(i,j)sum(x[[i]] == x[[j]]) / n))
     rownames(d) <- names(x)
     colnames(d) <- names(x)
     return(d)
    }
    
    my.dist(x)
    #           X1        X2  X3  X4        X5
    # X1 1.0000000 0.0000000 0.0 0.0 0.3333333
    # X2 0.0000000 1.0000000 0.5 0.5 0.1666667
    # X3 0.0000000 0.5000000 1.0 0.5 0.0000000
    # X4 0.0000000 0.5000000 0.5 1.0 0.0000000
    # X5 0.3333333 0.1666667 0.0 0.0 1.0000000