Search code examples
rsvdr-bigmemory

Problems with SVD of a large matrix using bigmemory and irlba


I am currently trying to implement a SVD of a very large matrix using bigmemory and irlba. As far as I understand I have to adjust the mult command in the irlba package, which I have done like this:

mult <- function(A, B, transpose=FALSE) {
  if(is.null(dim(B))) B <- cbind(B)
  if(transpose)
    return(cbind((t(B) %*% A)[]))
  cbind((A %*% B)[])
}

However, it does not work to run an SVD on a bigmatrix using irlba:

irlbaObject <- irlba(big, nv = 10, mult = mult)

For replicability here is an example of a big matrix I want to do a SVD on:

big <- file("big.txt", open = "a")
replicate(20, {
  x <- matrix(rnorm(100 * 100), nrow = 10)
  write.table(x, file  = 'big.txt', append = TRUE,
              row.names = FALSE, col.names = FALSE)
})

big <- read.big.matrix("big.txt", separated = FALSE,
                        type = "double",
                        backingfile = "big.bk",
                        backingpath = "/tmp",
                        descriptorfile = "big.desc")

This is the error message I get:

Error in A %*% B : requires numeric/complex matrix/vector arguments
Called from: cbind((A %*% B)[])

Does anyone have an idea how to avoid this error?


Solution

  • This should work:

    library(bigalgebra)
    library(irlba)
    
    ## --> CHANGES HERE <--
    setMethod("%*%", signature(x = "big.matrix", y = "numeric"),
              function(x, y) x %*% as.matrix(y))
    setMethod("%*%", signature(x = "numeric", y = "big.matrix"),
              function(x, y) t(x) %*% y)
    mult <- function(A, B) (A %*% B)[]
    
    # Repdata
    x <- matrix(rnorm(20 * 100 * 100), nrow = 20 * 10)
    big <- as.big.matrix(x)
    # Computation
    irlbaObject <- irlba(big, nv = 10, mult = mult)
    # Verification
    svd <- svd(x, nu = 10, nv = 10)
    plot(irlbaObject$u, svd$u)
    plot(irlbaObject$v, svd$v)
    

    Note 1: I think the algo in irlba has changed and now use only matrix-vector multiplications.

    Note 2: mult is a deprecated argument (it will disappear in next versions).

    Note 3: I'm not sure this solution will be fast. If you want a fast algorithm to compute partial SVD, try function big_randomSVD of package bigstatsr (disclaimer: I'm the author).