I have a large dataset (about 130 columns) that looks like this:
data <- data.frame(AA = c("Apple", "Banana", "0", "Cherry", "0"),
AB = c("0", "0", "Apple", "Cherry", "0"),
AC = c("0", "0", "0", "0", "Cherry"),
AD = c("Cherry", "0", "0", "Banana", "0"))
Can anyone help me find the number of exact matches between all possible pairs of columns? The output should look something like this:
> output
AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD
3 2 3 2 2 2
I have tried using length(intersect(data$AA, data$AB))
but I don't know how to apply this to the whole dataset within one command.
From searching other queries on Stackoverflow, I also tried first creating a dataframe with all possible column pairs, but I couldn't figure out how to actually make my output match my original column names to then apply the intersect
function.
N <- ncol(data)
combos <- expand.grid(1:N,1:N) %>%
filter(!Var1==Var2)
combos <- combos[!duplicated(t(apply(combos[1:2], 1, sort))),]
You can use combn
to create combination of column names and count common values between them using length
and intersect
.
val <- combn(names(data), 2, function(x)
length(intersect(data[[x[1]]], data[[x[2]]])))
names(val) <- combn(names(data), 2, paste0, collapse = '_')
val
#AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD
# 3 2 3 2 2 2