Search code examples
rmemoryhierarchical-clusteringvegan

R large distance matrix in vegan


I am running R 3.2.3 on a machine with 128 GB of RAM. I have a large matrix of 123028 rows x 168 columns. I would like to use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix in R using the vegdist() function in the vegan package with the method Bray-Curtis. I get an error about memory allocation:

df <- as.data.frame(matrix(rnorm(20668704), nrow = 123028))
library(vegan)
mydist <- vegdist(df)

Error in vegdist(df) : long vectors (argument 4) are not supported in .Fortran

If I use the pryr package to find out how much memory is needed for the distance matrix, I see that 121 GB are needed, which is less than the RAM that I have.

library(pryr)
mem_change(x <- 1:123028^2)

121 GB

I know there used to be a limit of 2 billion values for a single object in R, but I thought that limit disappeared in recent versions of R. Is there another memory limit I'm not aware of?

The bottom line is that I am wondering: What can I do about this error? Is it really because of memory limits or am I wrong about that? I would like to stay in R and use a clustering algorithm besides k-means, so I need to calculate a distance matrix.


Solution

  • R can handle long vectors just fine, but it seems that the distance matrix calculation is implemented in C or Fortran and being interfaced with R using .C or .Fortran, which do not accept long vectors (i.e. vectors with length > 2^32 -1) as arguments. See the docs here, which states:

    Note that the .C and .Fortran interfaces do not accept long vectors, so .Call (or similar) has to be used.

    Looking at the source code for the vegdist() function, it looks like your matrix is being converted into a vector and then passed to a function implemented in C to calculate the distances. The relevant lines of code:

    d <- .C("veg_distance", x = as.double(x), nr = N, nc = ncol(x), 
            d = double(N * (N - 1)/2), diag = as.integer(FALSE), 
            method = as.integer(method), NAOK = na.rm, PACKAGE = "vegan")$d
    

    And therein lies your problem. When your matrix is cast to a vector, it becomes a long vector, which is not supported by .C. You will have to look for a different package to calculate your distance matrix (or implement one yourself).