I have a 5000 x 1000 matrix of characters in R, with each entry being a color (red, blue, yellow, green, etc.). I would like to compute the frequency of matching colors (character strings) in a pairwise fashion between each row of the matrix across all columns. Each of the 1000 columns presents a different iteration of the color labels with no restrictions on the number of different labels per column. For instance, the first column might have 8 different color labels, while the second column has 10, and the third has 11, etc. I am not interested in the labels themselves, only the frequency that a pair of rows matches or does not across every column.
For example, my character matrix looks something like this (without the artificial regularly repeating color patterns):
colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
[,1] [,2] [,3] [,4] [,5]
[1,] "brown" "purple" "yellow" "green" "brown"
[2,] "grey" "red" "brown" "red" "grey"
[3,] "purple" "yellow" "green" "brown" "purple"
[4,] "red" "brown" "red" "grey" "red"
[5,] "yellow" "green" "brown" "purple" "yellow"
[6,] "brown" "red" "grey" "red" "brown"
[7,] "green" "brown" "purple" "yellow" "green"
[8,] "red" "grey" "red" "brown" "red"
[9,] "brown" "purple" "yellow" "green" "brown"
[10,] "grey" "red" "brown" "red" "grey"
I would like to use this to construct a 5000 x 5000 square, symmetric matrix that corresponds to the frequency of pairwise matches between rows. Each entry [i, j] (and also [j, i]) should be the frequency of a match between the ith and jth rows across all columns. For example, in the toy labels matrix above, row 1 matches row 6 in both the 1st and 5th columns but not the others, so I would want that matching frequency (2/5 = 0.4) to be the entries [1, 6] and [6, 1] of the "frequency matrix". The diagonal would be all 1's since each row always matches itself. Something like this output:
freq.mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 0 0 0 0 0.4 0 0 1 0
[2,] 0 1 0 0 0.2 0.4 0 0 0 1
[3,] 0 0 1 0 0 0 0 0.2 0 0
[4,] 0 0 0 1 0 0 0.2 0.6 0 0
[5,] 0 0.2 0 0 1 0 0 0 0 0.2
[6,] 0.4 0.4 0 0 0 1 0 0 0.4 0.4
[7,] 0 0 0 0.2 0 0 1 0 0 0
[8,] 0 0 0.2 0.6 0 0 0 1 0 0
[9,] 1 0 0 0 0 0.4 0 0 1 0
[10,] 0 1 0 0 0.2 0.4 0 0 0 1
I tried to apply a rowSums function as follows:
freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10
which generated an appropriately sized matrix, but the entries were not the pairwise row matching frequencies as I hoped. I also tinkered with some nested for loops, but could not make much progress and this also felt very "against the spirit" of R programming.
Could anyone kindly point me in the right direction? Thank you very much!
You are comparing wrong values:
apply(labels, 1, function(x) colMeans(x == t(labels)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[2,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0
[3,] 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0
[4,] 0.0 0.0 0.0 1.0 0.0 0.0 0.2 0.6 0.0 0.0
[5,] 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2
[6,] 0.4 0.4 0.0 0.0 0.0 1.0 0.0 0.0 0.4 0.4
[7,] 0.0 0.0 0.0 0.2 0.0 0.0 1.0 0.0 0.0 0.0
[8,] 0.0 0.0 0.2 0.6 0.0 0.0 0.0 1.0 0.0 0.0
[9,] 1.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0 1.0 0.0
[10,] 0.0 1.0 0.0 0.0 0.2 0.4 0.0 0.0 0.0 1.0