I have a super large data frame containing nearly 5 million rows. data then I have a char list containing around 2000 items, I need to do a pairwise(lets say one is A other is B) calculation on them, so at the end, I will have a 2000*2000 matrix containing values. The value I need is: (#id has A and B)/ min(#id has A, #id has B)
load("data.RData")
n = length(itemlist) # n=1831
a = matrix(0, n, n)
rownames(a) <- colnames(a) <- itemlist
aa = sapply(itemlist, function(x) grepl(x, data$Item))
for(i in 1:1830) {
for(j in (i+1):1831) {
a1 <- aa[,i]
a2 <- aa[,j]
a3 <- a1 & a2
a[i,j] <- sum(a3) / min(sum(a1), sum(a2))
}
print(i)
}
result <- a
This code works but it is super slow(take days). I was wondering if it can be much faster.
Here's a simple approach using base R (single thread). Let's start from a boolean matrix represented as 0,1:
mat <- as.integer(rnorm(10*10) > 0) |>
matrix(nrow = 10)
##> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##> [1,] 0 1 0 0 1 1 1 1 0 0
##> [2,] 0 0 0 0 0 1 1 0 0 0
##> [3,] 1 1 1 0 1 0 1 1 1 0
##> [4,] 0 1 1 0 1 0 1 1 1 0
##> [5,] 0 1 0 0 0 0 0 1 0 1
##> [6,] 1 0 0 0 0 0 1 1 1 0
##> [7,] 0 0 0 0 1 1 1 1 0 0
##> [8,] 0 1 0 1 1 1 1 0 1 1
##> [9,] 1 0 0 1 0 1 1 1 1 1
##>[10,] 0 1 1 0 1 0 0 0 0 1
The value S[i,j]
representing sum(mat[i,] & mat[j,])
is also given by the scalar product of mat[i,]
and mat[j,]
. Therefore, the matrix S
can be obtained by the matrix product of mat
and t(mat)
:
S <- mat %*% t(mat)
The sum C[i]
of true
values in each row mat[i,]
can be calculated straightforwardly:
C <- apply(mat, 1, sum)
Then, we obtain the matrix H
where the element H[i,j]
is the minimum of C[i]
and C[j]
.
H <- outer(C, C, "pmin")
Finally, we divide S
by H
to obtain the desired matrix:
S/H
##> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
##> [1,] 1.0000000 1.0 0.8000000 0.8000000 0.6666667 0.5000000 1.0000000
##> [2,] 1.0000000 1.0 0.5000000 0.5000000 0.0000000 0.5000000 1.0000000
##> [3,] 0.8000000 0.5 1.0000000 1.0000000 0.6666667 1.0000000 0.7500000
##> [4,] 0.8000000 0.5 1.0000000 1.0000000 0.6666667 0.7500000 0.7500000
##> [5,] 0.6666667 0.0 0.6666667 0.6666667 1.0000000 0.3333333 0.3333333
##> [6,] 0.5000000 0.5 1.0000000 0.7500000 0.3333333 1.0000000 0.5000000
##> [7,] 1.0000000 1.0 0.7500000 0.7500000 0.3333333 0.5000000 1.0000000
##> [8,] 0.8000000 1.0 0.5714286 0.6666667 0.6666667 0.5000000 0.7500000
##> [9,] 0.6000000 1.0 0.5714286 0.5000000 0.6666667 1.0000000 0.7500000
##>[10,] 0.5000000 0.0 0.7500000 0.7500000 0.6666667 0.0000000 0.2500000
##> [,8] [,9] [,10]
##> [1,] 0.8000000 0.6000000 0.5000000
##> [2,] 1.0000000 1.0000000 0.0000000
##> [3,] 0.5714286 0.5714286 0.7500000
##> [4,] 0.6666667 0.5000000 0.7500000
##> [5,] 0.6666667 0.6666667 0.6666667
##> [6,] 0.5000000 1.0000000 0.0000000
##> [7,] 0.7500000 0.7500000 0.2500000
##> [8,] 1.0000000 0.7142857 0.7500000
##> [9,] 0.7142857 1.0000000 0.2500000
##>[10,] 0.7500000 0.2500000 1.0000000