I have a problem large problem, and a more specific problem that I'm hoping will--once solved--solve the larger problem. I would really appreciate it if anyone has any ideas for me to try.
Basically I have a huge sparse matrix (about 300k x 150k, originally a Term-Document matrix created with R's {tm} package) that is saved as a simple triplet matrix using the {slam} package and I'm running a function that loops through sets of terms and then subsets it based on those terms. Unfortunately, the subsetting process is prohibitively slow.
In trying to figure out how to subset more quickly, I stumbled on the data.table package, which performed very well in some tests I ran with it. However, when I try to convert my sparse matrix into a data.table, I get
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
I understand that this is because it tries to convert it to a standard matrix first, which is technically a vector for R, and 300k*150k is well above the .Machine$integer.max
.
So my question: does anyone know how to convert a simple triplet matrix into a data.frame or data.table without converting it to a matrix first, thereby avoiding the integer overflow?
If not, does anyone a) have another workaround or b) have any advice on quickly subsetting huge sparse matrices and/or simple triple matrices?
Below is a reproducible example to mess around with. On my machine, the loop, which subsets each of the first 10 rows, takes about 3 secs. Once we get into looping over hundreds of thousands of rows, that get prohibitive quickly. Thanks in advance for the help:
require(slam)
STM <- simple_triplet_matrix(i = as.integer(runif(10000000,1,300000)),
j = as.integer(runif(10000000,1,150000)),
v = rep(rnorm(10), 1000000),
nrow = 300000,
ncol = 150000)
start <- Sys.time()
for (i in 1:10) {
vec <- as.matrix(STM[,i])
}
Sys.time() - start
Sidenote: notice that if you try STMm <- as.matrix(STM)
you get the same overflow error I showed above.
The STM object is actually just a list, you can subset normally:
STM_DT <- data.table(i = STM$i, j = STM$j, v = STM$v)
This gives:
> STM_DT
i j v
1: 186598 756 0.34271080
2: 278329 72334 2.03924976
3: 178388 32708 1.03925605
4: 260635 101424 0.05780086
5: 169321 126202 1.00027529
---
9999996: 96209 90019 -1.09341023
9999997: 54467 16612 -2.08070273
9999998: 179029 96906 -0.86197333
9999999: 153017 148731 0.47765003
10000000: 104145 123291 0.24258613
Speed is almost instantaneous