Search code examples
rmatrixsparse-matrixtmcosine-similarity

Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R


My task is to compare documents in a corpus by the cosine similarity. I use tm package and obtain the TermDocumentMatrix (in td-idf form) tdm. The following task should as simple as stated in here

d <- dist(tdm, method="cosine")

or

cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))

However, the number of terms in my tdm is quite large, more than 120,000 (with around 50,000 documents). It is beyond the capability of R to handle such matrix. My RStudio crashed several times.

My questions are 1) how can I handle such a large matrix and get the pair-wise (120,000*120,000) cosine similarity? 2) if impossible, how can I just get the cosine similarity of only two documents at one time? Suppose I want the similarity between document 10 and 21, then something like

sim10_21<-cosine_similarity(tdm, d1=10,d2=21)

If tdm is a simple matrix, I can do the calculate on tdm[,c(10,21)]. However, to convert tdm to a matrix is exactly what I cannot handle. My questions ultimately boils down to how to do matrix-like calculate on tdm.


Solution

  • 120,000 x 120,000 matrix * 8 bytes (dbl float) = 115.2 gigabytes. This isn't necessarily beyond the capability of R, but you do need at least that much memory, regardless of what language you use. Realistically, you'll probably want to write to the disk, either using some database such as Sql (e.g. RSQLite package) or if you plan to only use R in your analysis, it might be better to use the "ff" package for storing/accessing large matrices on disk.

    You could do this iteratively and multithread it to improve the speed of calculation.

    To find the distance between two docs, you can do something like this:

    dist(t(tdm[,1]), t(tdm[,2]), method='cosine')