Search code examples
rcluster-analysis

dist() function in R: vector size limitation


I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried

d<-dist(as.matrix(file),method="euclidean")

I got this error

Error: cannot allocate vector of size 1101.1 Gb

How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.

Thanks!


Solution

  • Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.

    In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.

    Lets assume you have data like this:

    data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
    

    What you want to do is:

     # Create transposed data matrix
     data.matrix.t <- t(as.matrix(data))
    
     # Create distance matrix
     dists <- dist(data.matrix.t)
    
     # Clustering
     hcl <- hclust(dists)
    
     # Plot
     plot(hcl)
    

    NOTE

    You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.