r csv amazon-ec2 cluster-analysis r-daisy

R Cluster Package Error Daisy() function long vectors (argument 11) are not supported in .C

Trying to convert a data.frame with numeric, nominal, and NA values to a dissimilarity matrix using the daisy function from the cluster package in R. My purpose involves creating a dissimilarity matrix before applying k-means clustering for customer segmentation. The data.frame has 133,153 rows and 36 columns. Here's my machine.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform x86_64-w64-mingw32/x64 (64-bit)

How can I fix the daisy warning?

Since the Windows computer has 3 Gb RAM, I increased the virtual memory to 100GB hoping that would be enough to create the matrix - it didn't work. I still got a couple errors about the memory. I've looked into other R packages for solving the memory problem, but they don't work. I cannot use the bigmemory with the biganalytics package because it only accepts numeric matrices. The clara and ff packages also accept only numeric matrices.

CRAN's cluster package suggests the gower similarity coefficient as a distance measure before applying k-means. The gower coefficient takes numeric, nominal, and NA values.

Store1 <- read.csv("/Users/scdavis6/Documents/Work/Client1.csv", head=FALSE)
df <- as.data.frame(Store1)
save(df, file="df.Rda")
library(cluster)
daisy1 <- daisy(df, metric = "gower", type = list(ordratio = c(1:35)))
#Error in daisy(df, metric = "gower", type = list(ordratio = c(1:35))) :
#long vectors (argument 11) are not supported in .C

**EDIT: I have RStudio lined to Amazon Web Service's (AWS) r3.8xlarge with 244Gbs of memory and 32 vCPUs. I tried creating the daisy matrix on my computer, but did not have enough RAM. **

**EDIT 2: I used the clara function for clustering the dataset. **

#50 samples
clara2 <- clara(df, 3, metric = "euclidean", stand = FALSE, samples = 50,
                rngR = FALSE, pamLike = TRUE)

Solution

Use algorithms that do not require O(n^2) memory, if you have a lot of data. Swapping to disk will kill performance, this is not a sensible option.

Instead, try either to reduce your data set size, or use index acceleration to avoid the O(n^2) memory cost. (And it's not only O(n^2) memory, but also O(n^2) distance computations, which will take a long time!)