Search code examples
rdataframesimilarity

Comparison of variable to all other variables in data frame


I want to check how close is one variable to all of the other variables in a data frame. I want to do this by counting the times they have the same value for the same row (i.e. same observation). For instance, in the mtcars dataset, the variables gear and carb have 7 observations in which they have the same value in the same row (i.e. same car).

I have tried the following, which yields a closeness_matrix. However, the results seem to be non-sensical. Any idea what is not working?

PS: I also tried to use mapply, which I guess it would be faster, but it didn’t work so I ended up with the nested loop.

MWE:

cols_ls <- colnames(mtcars)

closeness_matrix <- matrix(nrow = ncol(mtcars),
                            ncol = ncol(mtcars))

row.names(closeness_matrix) <- cols_ls; colnames(closeness_matrix) <- cols_ls


for (i in 1:length(cols_ls)){

  for (j in i:length(cols_ls)){

    closeness_matrix[i,j] <- sum(duplicated(mtcars[,c(i,j), with = FALSE])==TRUE)

  }
}

Solution

  • I guess the following will do it (but i'm sure there is a smarter way):

    closenessFunc<-function(v1,M){
          apply(M, 2, function(x,v2) {
            sum(x==v)
          }, v2=v1)
        }
    apply(mtcars, MARGIN = 2, closenessFunc, M=mtcars)
    

    output:

         mpg cyl disp hp drat wt qsec vs am gear carb
    mpg   32   0    0  0    0  0    0  0  0    0    0
    cyl    0  32    0  0    0  0    0  0  0    8    2
    disp   0   0   32  0    0  0    0  0  0    0    0
    hp     0   0    0 32    0  0    0  0  0    0    0
    drat   0   0    0  0   32  0    0  0  0    1    0
    wt     0   0    0  0    0 32    0  0  0    0    0
    qsec   0   0    0  0    0  0   32  0  0    0    0
    vs     0   0    0  0    0  0    0 32 19    0    7
    am     0   0    0  0    0  0    0 19 32    0    4
    gear   0   8    0  0    1  0    0  0  0   32    7
    carb   0   2    0  0    0  0    0  7  4    7   32