Search code examples
rintersect

Find length of intersect between all pairs of variables


I have a large dataset (about 130 columns) that looks like this:

data <- data.frame(AA = c("Apple", "Banana", "0", "Cherry", "0"),
                   AB = c("0", "0", "Apple", "Cherry", "0"),
                   AC = c("0", "0", "0", "0", "Cherry"),
                   AD = c("Cherry", "0", "0", "Banana", "0"))

Can anyone help me find the number of exact matches between all possible pairs of columns? The output should look something like this:

> output
  AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD
      3     2     3     2     2     2

I have tried using length(intersect(data$AA, data$AB)) but I don't know how to apply this to the whole dataset within one command. From searching other queries on Stackoverflow, I also tried first creating a dataframe with all possible column pairs, but I couldn't figure out how to actually make my output match my original column names to then apply the intersect function.

N <- ncol(data)
combos <- expand.grid(1:N,1:N) %>%
  filter(!Var1==Var2)
combos <- combos[!duplicated(t(apply(combos[1:2], 1, sort))),]

Solution

  • You can use combn to create combination of column names and count common values between them using length and intersect.

    val <- combn(names(data), 2, function(x) 
                 length(intersect(data[[x[1]]], data[[x[2]]])))
    
    names(val) <- combn(names(data), 2, paste0, collapse =  '_')
    val
    #AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD 
    #    3     2     3     2     2     2