Search code examples
rloopslarge-datapairwise

Large data pairwise calculation in R


I have a super large data frame containing nearly 5 million rows. data then I have a char list containing around 2000 items, I need to do a pairwise(lets say one is A other is B) calculation on them, so at the end, I will have a 2000*2000 matrix containing values. The value I need is: (#id has A and B)/ min(#id has A, #id has B)

load("data.RData")  
    
n = length(itemlist) # n=1831
    
a = matrix(0, n, n)

rownames(a) <- colnames(a) <- itemlist

aa = sapply(itemlist, function(x) grepl(x, data$Item))

for(i in 1:1830) {
  
  for(j in (i+1):1831) {
    
    a1 <- aa[,i]
    a2 <- aa[,j]
    a3 <- a1 & a2
    
    a[i,j] <- sum(a3) / min(sum(a1), sum(a2))  
    
  }
  print(i)
}

result <- a

This code works but it is super slow(take days). I was wondering if it can be much faster.


Solution

  • Here's a simple approach using base R (single thread). Let's start from a boolean matrix represented as 0,1:

    mat <- as.integer(rnorm(10*10) > 0) |>
      matrix(nrow = 10)
    
    ##>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    ##> [1,]    0    1    0    0    1    1    1    1    0     0
    ##> [2,]    0    0    0    0    0    1    1    0    0     0
    ##> [3,]    1    1    1    0    1    0    1    1    1     0
    ##> [4,]    0    1    1    0    1    0    1    1    1     0
    ##> [5,]    0    1    0    0    0    0    0    1    0     1
    ##> [6,]    1    0    0    0    0    0    1    1    1     0
    ##> [7,]    0    0    0    0    1    1    1    1    0     0
    ##> [8,]    0    1    0    1    1    1    1    0    1     1
    ##> [9,]    1    0    0    1    0    1    1    1    1     1
    ##>[10,]    0    1    1    0    1    0    0    0    0     1
    

    The value S[i,j] representing sum(mat[i,] & mat[j,]) is also given by the scalar product of mat[i,] and mat[j,]. Therefore, the matrix S can be obtained by the matrix product of mat and t(mat):

    S <- mat %*% t(mat) 
    

    The sum C[i] of true values in each row mat[i,] can be calculated straightforwardly:

    C <- apply(mat, 1, sum)
    

    Then, we obtain the matrix H where the element H[i,j] is the minimum of C[i] and C[j].

    H <- outer(C, C, "pmin")
    

    Finally, we divide S by H to obtain the desired matrix:

    S/H
    ##>           [,1] [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
    ##> [1,] 1.0000000  1.0 0.8000000 0.8000000 0.6666667 0.5000000 1.0000000
    ##> [2,] 1.0000000  1.0 0.5000000 0.5000000 0.0000000 0.5000000 1.0000000
    ##> [3,] 0.8000000  0.5 1.0000000 1.0000000 0.6666667 1.0000000 0.7500000
    ##> [4,] 0.8000000  0.5 1.0000000 1.0000000 0.6666667 0.7500000 0.7500000
    ##> [5,] 0.6666667  0.0 0.6666667 0.6666667 1.0000000 0.3333333 0.3333333
    ##> [6,] 0.5000000  0.5 1.0000000 0.7500000 0.3333333 1.0000000 0.5000000
    ##> [7,] 1.0000000  1.0 0.7500000 0.7500000 0.3333333 0.5000000 1.0000000
    ##> [8,] 0.8000000  1.0 0.5714286 0.6666667 0.6666667 0.5000000 0.7500000
    ##> [9,] 0.6000000  1.0 0.5714286 0.5000000 0.6666667 1.0000000 0.7500000
    ##>[10,] 0.5000000  0.0 0.7500000 0.7500000 0.6666667 0.0000000 0.2500000
    ##>           [,8]      [,9]     [,10]
    ##> [1,] 0.8000000 0.6000000 0.5000000
    ##> [2,] 1.0000000 1.0000000 0.0000000
    ##> [3,] 0.5714286 0.5714286 0.7500000
    ##> [4,] 0.6666667 0.5000000 0.7500000
    ##> [5,] 0.6666667 0.6666667 0.6666667
    ##> [6,] 0.5000000 1.0000000 0.0000000
    ##> [7,] 0.7500000 0.7500000 0.2500000
    ##> [8,] 1.0000000 0.7142857 0.7500000
    ##> [9,] 0.7142857 1.0000000 0.2500000
    ##>[10,] 0.7500000 0.2500000 1.0000000