Search code examples
rsparse-matrixsparse-file

Build bigsparser::SFBM matrix iteratively


I wish to build a matrix with 10^7 columns and 2500 rows. Since this is too large for my computer, I thought I could create the matrix iteratively. I would like to use the bigsparser package for storing the matrix on disk.

Here is how I create the first matrix:

library(bigsparser)
library(data.table)
library(Matrix)
nvars <- 10000000  # columns
ncons <- 10        # rows
n_nonzero <- round(0.02*nvars*ncons) # approximate, there may be actually less values
set.seed(13)

# the first table
Amat <- data.frame(
    i=sample.int(ncons, n_nonzero, replace=TRUE),
    j=sample.int(nvars, n_nonzero, replace=TRUE),
    x=runif(n_nonzero)
)
setDT(Amat)
Amat <- unique(Amat, by=c("i", "j"))
AmatSparse <- sparseMatrix(
    i=Amat[,get("i")], j=Amat[,get("j")], x=Amat[,get("x")],
    dims=c(2500, 10^7L)
)
AmatSFBM <- as_SFBM(AmatSparse, backingfile="sparsemat", compact = FALSE)

As you can see, I know the dimensions of the final matrix beforehand and have set it accordingly.

Now I want to add some rows, like that:

for (iter in 2:250) {
    Amat <- data.frame(
        i=sample.int(ncons, n_nonzero, replace=TRUE),
        j=sample.int(nvars, n_nonzero, replace=TRUE),
        x=runif(n_nonzero)
    )
    setDT(Amat)
    Amat <- unique(Amat, by=c("i", "j"))
    Amat[,i:=i+(iter-1)*500]

    # this does not work:
    AmatSFBM[Amat[,get("i")], Amat[,get("j")]] <- Amat[,get("x")]
}

However, the ]<- operator seems not to work for SFBM objects.

Is there any way to build a SFBM object other than as_SFBM from a sparse matrix? For example,

  • can I add two SFBM objects of the same dimensions
  • can I create a SFBM object from a CSV file or similar?

Both would be fine.


Solution

  • The SFBM class has a method $add_columns()⁠ which you can use to iteratively grow your matrix. Generally, when you are memory constrained, it is a good idea to avoid unnecessary intermediate assignments. In the following piece of code I first write a function to generate the component sparse matrices. Then I create a starting matrix and finally iteratively add the component matrices. I've limited it to 9 iterations for this example, but you can just set it to 249 to get your full matrix.

    library(bigsparser)
    library(data.table)
    library(Matrix)
    
    set.seed(13)
    
    # Function to generate component matrix
    generate_sparse_mat <- \(nrow = 2500, ncol = 40000, n_nonzero = round(0.02*nrow*ncol)) {
      data.table(
        i = sample.int(nrow, n_nonzero, replace = TRUE),
        j = sample.int(ncol, n_nonzero, replace = TRUE),
        x = runif(n_nonzero)
      ) |>
        unique(by = c("i", "j")) |>
        as.list() |>
        c(dims = list(c(nrow, ncol))) |>
        do.call(what = sparseMatrix)
    }
    
    # Starting matrix
    mat <- generate_sparse_mat() |> 
      as_SFBM(compact = FALSE)
    
    # Iteratively add colums
    for (k in seq_len(9)) mat$add_columns(generate_sparse_mat(), offset_i = 0)
    
    mat
    #> A Sparse Filebacked Big Matrix with 2500 rows and 400000 columns.