Search code examples
rplotcluster-analysislarge-data

How create cluster plots for large datasets in R


I use the CLARA algorithm from Kaufman and Rousseeuw to cluster a large dataset with N > 8*10^6 in R. The implementation of the algorithm itself allows the user to control execution time by e.g. limiting the samplesize to n=100.

However it seems that the use of the plot() function in R includes all data-objects to the plot which results in a very large processing time and very crowded plots (see the reproducible example below).

In theory it should be possible to only plot the best sample from CLARA instead of N. Is there an implementation for this or how can I work around this issue?

## generate 2.5 mio objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10^6,0,0.5), rnorm(10^6,0,0.5)),
           cbind(rnorm(1.5*10^6,5,0.5), rnorm(1.5*10^6,5,0.5)))

library("cluster")
# get clusters solution
clara.x<-clara(x,k=2,sampsize = 100)
# see medoids
clara.x$medoids

# plot the cluster solution
plot(clara.x) # takes long time. creates crowded plot
clusplot(clara.x) # did not finish

enter image description here


Solution

  • First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.

    Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of clara.x. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the clara method. But if you want to plot more you can choose with sample() or something:

    # Manually shrinking clara object
    samp <- clara.x$sample
    clara.x$data <- clara.x$data[samp, ]
    clara.x$clustering <- clara.x$clustering[samp]
    clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp
    
    # plot the cluster solution
    clusplot(clara.x)
    

    One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given samp, add the following after the 2nd line above:

    samp <- union(samp, clara.x$i.med)
    

    ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.