Search code examples
rcluster-analysisk-meanshierarchical-clustering

Clustering Variables in R and Memory Usage


I'm trying to calculate clusters of some variables in R with cluster library. The code goes like this:

d2 <- dist(ant, method = "euclidian")

The problem is that shows this message:

Error: cannot allocate vector of size 123.5 Gb

It's impossible to have that amount of memory. My dataframe has more than 180000 rows and 12 columns. Any suggestion?


Solution

    1. Choose an approach that does not require a pairwise distance matrix, which will always require O(n²) memory... Such algorithms exist several.

    2. Simplify your data first. For example, merge duplicates into weights, and use an algorithm/implementation that supports weighted points.

    3. Subsample. If you have this many points, you probably do not need all of them. Work with a subsample instead.