Search code examples
rdataframedata.tabletm

Convert huge sparse matrix into data.table for faster subsetting in R


I have a problem large problem, and a more specific problem that I'm hoping will--once solved--solve the larger problem. I would really appreciate it if anyone has any ideas for me to try.

Basically I have a huge sparse matrix (about 300k x 150k, originally a Term-Document matrix created with R's {tm} package) that is saved as a simple triplet matrix using the {slam} package and I'm running a function that loops through sets of terms and then subsets it based on those terms. Unfortunately, the subsetting process is prohibitively slow.

In trying to figure out how to subset more quickly, I stumbled on the data.table package, which performed very well in some tests I ran with it. However, when I try to convert my sparse matrix into a data.table, I get

Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

I understand that this is because it tries to convert it to a standard matrix first, which is technically a vector for R, and 300k*150k is well above the .Machine$integer.max.

So my question: does anyone know how to convert a simple triplet matrix into a data.frame or data.table without converting it to a matrix first, thereby avoiding the integer overflow?

If not, does anyone a) have another workaround or b) have any advice on quickly subsetting huge sparse matrices and/or simple triple matrices?

Below is a reproducible example to mess around with. On my machine, the loop, which subsets each of the first 10 rows, takes about 3 secs. Once we get into looping over hundreds of thousands of rows, that get prohibitive quickly. Thanks in advance for the help:

require(slam)
STM <- simple_triplet_matrix(i = as.integer(runif(10000000,1,300000)), 
                  j = as.integer(runif(10000000,1,150000)),
                  v = rep(rnorm(10), 1000000),
                  nrow = 300000,
                  ncol = 150000)

start <- Sys.time()
for (i in 1:10) {
  vec <- as.matrix(STM[,i])
}
Sys.time() - start

Sidenote: notice that if you try STMm <- as.matrix(STM) you get the same overflow error I showed above.


Solution

  • The STM object is actually just a list, you can subset normally:

    STM_DT <- data.table(i = STM$i, j = STM$j, v = STM$v)

    This gives:

    > STM_DT
                   i      j           v
           1: 186598    756  0.34271080
           2: 278329  72334  2.03924976
           3: 178388  32708  1.03925605
           4: 260635 101424  0.05780086
           5: 169321 126202  1.00027529
          ---                          
     9999996:  96209  90019 -1.09341023
     9999997:  54467  16612 -2.08070273
     9999998: 179029  96906 -0.86197333
     9999999: 153017 148731  0.47765003
    10000000: 104145 123291  0.24258613
    

    Speed is almost instantaneous