Search code examples
rmatrixfrequencypairwise

R: Count the frequency of pairwise matching strings between all rows of a matrix


I have a 5000 x 1000 matrix of characters in R, with each entry being a color (red, blue, yellow, green, etc.). I would like to compute the frequency of matching colors (character strings) in a pairwise fashion between each row of the matrix across all columns. Each of the 1000 columns presents a different iteration of the color labels with no restrictions on the number of different labels per column. For instance, the first column might have 8 different color labels, while the second column has 10, and the third has 11, etc. I am not interested in the labels themselves, only the frequency that a pair of rows matches or does not across every column.

For example, my character matrix looks something like this (without the artificial regularly repeating color patterns):

colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
     [,1]     [,2]     [,3]     [,4]     [,5]    
 [1,] "brown"  "purple" "yellow" "green"  "brown" 
 [2,] "grey"   "red"    "brown"  "red"    "grey"  
 [3,] "purple" "yellow" "green"  "brown"  "purple"
 [4,] "red"    "brown"  "red"    "grey"   "red"   
 [5,] "yellow" "green"  "brown"  "purple" "yellow"
 [6,] "brown"  "red"    "grey"   "red"    "brown" 
 [7,] "green"  "brown"  "purple" "yellow" "green" 
 [8,] "red"    "grey"   "red"    "brown"  "red"   
 [9,] "brown"  "purple" "yellow" "green"  "brown" 
[10,] "grey"   "red"    "brown"  "red"    "grey"  

I would like to use this to construct a 5000 x 5000 square, symmetric matrix that corresponds to the frequency of pairwise matches between rows. Each entry [i, j] (and also [j, i]) should be the frequency of a match between the ith and jth rows across all columns. For example, in the toy labels matrix above, row 1 matches row 6 in both the 1st and 5th columns but not the others, so I would want that matching frequency (2/5 = 0.4) to be the entries [1, 6] and [6, 1] of the "frequency matrix". The diagonal would be all 1's since each row always matches itself. Something like this output:

freq.mat
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]    
 [1,]  1     0     0     0     0    0.4    0     0     1      0
 [2,]  0     1     0     0    0.2   0.4    0     0     0      1     
 [3,]  0     0     1     0     0     0     0    0.2    0      0
 [4,]  0     0     0     1     0     0    0.2   0.6    0      0
 [5,]  0    0.2    0     0     1     0     0     0     0     0.2
 [6,] 0.4   0.4    0     0     0     1     0     0    0.4    0.4 
 [7,]  0     0     0    0.2    0     0     1     0     0      0 
 [8,]  0     0    0.2   0.6    0     0     0     1     0      0   
 [9,]  1     0     0     0     0    0.4    0     0     1      0 
[10,]  0     1     0     0    0.2   0.4    0     0     0      1 

I tried to apply a rowSums function as follows:

freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10

which generated an appropriately sized matrix, but the entries were not the pairwise row matching frequencies as I hoped. I also tinkered with some nested for loops, but could not make much progress and this also felt very "against the spirit" of R programming.

Could anyone kindly point me in the right direction? Thank you very much!


Solution

  • You are comparing wrong values:

    apply(labels, 1, function(x) colMeans(x == t(labels)))
    
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
     [1,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
     [2,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0
     [3,]  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.2  0.0   0.0
     [4,]  0.0  0.0  0.0  1.0  0.0  0.0  0.2  0.6  0.0   0.0
     [5,]  0.0  0.2  0.0  0.0  1.0  0.0  0.0  0.0  0.0   0.2
     [6,]  0.4  0.4  0.0  0.0  0.0  1.0  0.0  0.0  0.4   0.4
     [7,]  0.0  0.0  0.0  0.2  0.0  0.0  1.0  0.0  0.0   0.0
     [8,]  0.0  0.0  0.2  0.6  0.0  0.0  0.0  1.0  0.0   0.0
     [9,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
    [10,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0