Search code examples
rsparse-matrix

Same sparse matrix, different object sizes


I was working on creating some adjacency matrices and stumbled on a weird issue.

I have one matrix full of 1s and 0s. I want to multiply the transpose of it by it (t(X) %*% X) and then run some other stuff. Since the routine started to get real slow I converted it to a sparse matrix, which obviously went faster.

However, the sparse matrix gets double the size depending on when I convert the matrix to a sparse format.

Here is some generic example that runs into the same issue

set.seed(666)
nr = 10000
nc = 1000

bb = matrix(rnorm(nc *nr), ncol = nc, nrow = nr)
bb = apply(bb, 2, function(x) x = as.numeric(x > 0))

# Slow and unintelligent method
op1  = t(bb) %*% bb
op1  = Matrix(op1, sparse = TRUE) 

# Fast method
B   = Matrix(bb, sparse = TRUE) 
op2 = t(B) %*% B

# weird
identical(op1, op2) # returns FALSE
object.size(op2)
#12005424 bytes
object.size(op1) # almost half the size
#6011632 bytes

# now it works...
ott1 = as.matrix(op1)
ott2 = as.matrix(op2)

identical(ott1, ott2) # returns TRUE

Then I got curious. Anybody knows why this happens?


Solution

  • The class of op1 is dsCMatrix, whereas op2 is a dgCMatrix. dsCMatrix is a class for symmetric matrices, which therefore only needs to store the upper half plus the diagonal (roughly half as much data as the full matrix).

    The Matrix statement that converts a dense to a sparse matrix is smart enough to choose a symmetric class for symmetric matrices, hence the saving. You can see this in the code for the function Matrix, which explicitly performs the test isSym <- isSymmetric(data).

    %*% on the other hand is optimised for speed and does not perform this check.