I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above. When I directly try to use the matrix in the sim2 function, it takes over 30 minutes and culminutes in error message
This is the R script I use:
library(text2vec)
JaccSim <- sim2(my_sparse_mx, method = "jaccard", norm = "none") # doesn't work
This is the error message i get after half an hour of running the script:
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92.
However, when I subset another sparse matrix from the original matrix, using all the rows and run the script, it takes only 3 minutes and the Jaccard similarity matrix (which is a sparse matrix itself) is generated successfully.
spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
JaccSim <- sim2(spmx_1, method = "jaccard", norm = "none") #works!
This one runs successfully. What is going on here? all I'm doing is subsetting my sparse_matrix into another matrix (using all rows of the original matrix) and using the second sparse matrix.
To clarify, my_sparse_mx has 210,000 rows (i created it having that many rows using the following:
my_sparse_mx <-Matrix(0,nrow = 210000,ncol = 500,sparse = TRUE))
and then filled it up with 1's accordingly throughout some other process. Also, when I do nrows(my_sparse_mx) I still get 210,000.
I'd like to know why this is happening.
sim2
function calculates pairwise jaccard similarity which means result matrix for your case will be 210000*210000. Sparsity of this resulting matrix depends on the data and for some cases won't be a problem. I guess for your case it is quite dense and can't be handled by underlying Matrix
routines.
Your subsetting as mentioned above is not correct - you missed comma. So you subset just first 210000 elements.