Search code examples
rperformancerowsparse-matrix

How to insert row in a (sparse) matrix efficiently?


I have a huge sparse matrix and need to insert a row at a specific row index. I essentially applied the technique suggested here, but that's very inefficient since it copies the original matrix.

Here's the code in a nutshell, the scope is to generate M2 efficiently (ideally, directly replacing M, so that M2 is not even needed):

library(Matrix)

M  = as(cbind(1:4,1:4),"sparseMatrix") # original sparse matrix
M # view
v  = c(2,4) # row indices where new rows should be inserted
M2 = M # copy matrix to keep sparse class
for(i in 1:length(v)) M2 = rbind(M2,rep(0, ncol(M2))) # insert new rows at end
M2[-v,] = M # get each row in v 2 times ## <- THIS takes very long
M2[v,] = 0 # populate the newly inserted rows (e.g. here: 0)
M2 # view

Ideally I'm looking for a function such as append for vectors or add_row for a tibble.

EDIT: I notice that not only is the procedure above inefficient, it's also not what I'm looking for, because it adds rows at indices 2 and 4 of the new (resulting) matrix M2, while I am looking for adding rows at indices 2 and 4 of the old (initial) matrix, i.e. the end result as of now is:

[1,] 1 1
[2,] . .
[3,] 2 2
[4,] . .
[5,] 3 3
[6,] 4 4

i.e. rows were added at indices 2 and 4 of the new matrix. What I am looking for is:

[1,] 1 1
[2,] . .
[3,] 2 2
[4,] 3 3
[5,] . .
[6,] 4 4

i.e. new rows have been added before indices 2 and 4 of the old matrix. In his +1 answer @Ben Bolker correctly interpreted this as such. :)


Solution

  • I think this should work reasonably efficiently: it manipulates the underlying row indices and dimensions of a matrix in sparse (dgCMatrix or dgTMatrix) format.

    add_rows <- function(M,v) {
        oldind <- seq(dim(M)[1])-1L   ## all rows (0-indexed)
        ## new row indices: count how many rows are inserted before a given row
        ## (maybe a clearer/better/more efficient way to do this?)
        newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
        ## modify dimensions
        M@Dim <- M@Dim + c(length(v),0L)
        ## modify row indices (1L accounts for 0 vs 1-indexing)
        M@i <- newind[M@i+1L]
        M
    }
    

    The part most likely to be a (memory/computation) bottleneck is the computation of the new row indices. I did this by constructing an index of all row numbers. The version below only does the newind computation for occupied rows, at the cost of additional calls to unique() and match().

    add_rows_2 <- function(M,v) {
        oldind <- unique(M@i)
        ## new row indices
        newind <- oldind + as.integer(rowSums(outer(oldind,v,">=")))
        ## modify dimensions
        M@Dim <- M@Dim + c(length(v),0L)
        M@i <- newind[match(M@i,oldind)]
        M
    }