Search code examples
rsparse-matrixsimilaritytext2vec

Why do I get two different performances when creating Jaccard similarity matrix using two sparse matrices that seem to be the same


I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above. When I directly try to use the matrix in the sim2 function, it takes over 30 minutes and culminutes in error message

This is the R script I use:

library(text2vec)
JaccSim <- sim2(my_sparse_mx, method = "jaccard", norm = "none")  # doesn't work

This is the error message i get after half an hour of running the script:

Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92.

However, when I subset another sparse matrix from the original matrix, using all the rows and run the script, it takes only 3 minutes and the Jaccard similarity matrix (which is a sparse matrix itself) is generated successfully.

spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
JaccSim <- sim2(spmx_1, method = "jaccard", norm = "none") #works!

This one runs successfully. What is going on here? all I'm doing is subsetting my sparse_matrix into another matrix (using all rows of the original matrix) and using the second sparse matrix.

To clarify, my_sparse_mx has 210,000 rows (i created it having that many rows using the following:

my_sparse_mx <-Matrix(0,nrow = 210000,ncol = 500,sparse = TRUE))

and then filled it up with 1's accordingly throughout some other process. Also, when I do nrows(my_sparse_mx) I still get 210,000.

I'd like to know why this is happening.


Solution

  • sim2 function calculates pairwise jaccard similarity which means result matrix for your case will be 210000*210000. Sparsity of this resulting matrix depends on the data and for some cases won't be a problem. I guess for your case it is quite dense and can't be handled by underlying Matrix routines.

    Your subsetting as mentioned above is not correct - you missed comma. So you subset just first 210000 elements.