I have a nxp very sparse count matrix with only non-negative values and columns named y_1, ... , y_p. (n=2 million and p=70)
I want to convert it, using R, into a matrix that counts the amount of times that y_i and y_j have a non-zero value on the same row.
Example:
ID a b c d e
1 1 0 1 0 0
2 0 1 1 0 0
3 0 0 1 1 0
4 1 1 0 0 0
and i want to obtain:
- a b c d e
a 2 1 1 0 0
b 1 2 1 0 0
c 1 1 3 1 0
d 0 0 1 1 0
e 0 0 0 0 0
This is a simple matrix multiplication.
t(m) %*% m
a b c d e
a 2 1 1 0 0
b 1 2 1 0 0
c 1 1 3 1 0
d 0 0 1 1 0
e 0 0 0 0 0
Using this data:
m = read.table(text = "ID a b c d e
1 1 0 1 0 0
2 0 1 1 0 0
3 0 0 1 1 0
4 1 1 0 0 0", header = T)
m = as.matrix(m[, -1])
This relies on the original matrix being only 1s and 0s. If it is not, you can create it with m = original_matrix > 0
Here's it working on a matrix like you describe:
library(Matrix)
nr = 2e6
nc = 70
mm = Matrix(0, nrow = nr, ncol = nc, sparse = T)
# make, on average, three 1s per row
set.seed(47)
mm[cbind(sample(nr, size = 3 * nr, replace = T), sample(nc, size = 3 * nr, replace = T))] = 1
system.time({res = t(mm) %*% mm})
# user system elapsed
# 0.836 0.057 0.895
format(object.size(res), units = "Mb")
[1] "0.1 Mb
On my laptop the calculation takes less than a second and the result is about 0.1 Mb.