Search code examples
rdistancesparse-matrixbigdatahierarchical-clustering

Hierarchical Clustering Large Sparse Distance Matrix R


I am attempting to perform fastclust on a very large set of distances, but running into a problem.

I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like:

> df   
kwd1 kwd2 similarity  
a  b  1  
b  a  1  
c  a  2  
a  c  2 

It is a sparse list and I can convert it into a sparse matrix using sparseMatrix():

> myMatrix 
  a b c  
a . . .
b 1 . .
c 2 . .

However, when I attempt to turn it into a dist object using as.dist(), I get the error that the 'problem is too large' from R. I have read the other dist questions on here, but the code others have suggested does not work for my above example data set.

Thanks for any help!


Solution

  • While using a sparse matrix in the first place seems like a good idea, I think there is a bit of a problem with that approach: your missing distances will be coded as 0s, not as NAs (see Creating (and Accessing) a Sparse Matrix with NA default entries). As you know, when clustering, a zero dissimilarity has a totally different meaning than a missing one...

    So anyway, what you need is a dist object with a lot of NAs for your missing dissimilarities. Unfortunately, your problem is so big that it would require too much memory:

    d <- dist(x = rep(NA_integer_, 50000))
    # Error: cannot allocate vector of size 9.3 Gb
    

    And that's only dealing with the input... Even with a 64 bit machine with a lot of memory, I'm not sure the clustering algorithm itself would not choke or run indefinitely.

    You should consider breaking your problem into smaller pieces.