Search code examples
rk-means

Simple approach to assigning clusters for new data after k-means clustering


I'm running k-means clustering on a data frame df1, and I'm looking for a simple approach to computing the closest cluster center for each observation in a new data frame df2 (with the same variable names). Think of df1 as the training set and df2 on the testing set; I want to cluster on the training set and assign each test point to the correct cluster.

I know how to do this with the apply function and a few simple user-defined functions (previous posts on the topic have usually proposed something similar):

df1 <- data.frame(x=runif(100), y=runif(100))
df2 <- data.frame(x=runif(100), y=runif(100))
km <- kmeans(df1, centers=3)
closest.cluster <- function(x) {
  cluster.dist <- apply(km$centers, 1, function(y) sqrt(sum((x-y)^2)))
  return(which.min(cluster.dist)[1])
}
clusters2 <- apply(df2, 1, closest.cluster)

However, I'm preparing this clustering example for a course in which students will be unfamiliar with the apply function, so I would much prefer if I could assign the clusters to df2 with a built-in function. Are there any convenient built-in functions to find the closest cluster?


Solution

  • You could use the flexclust package, which has an implemented predict method for k-means:

    library("flexclust")
    data("Nclus")
    
    set.seed(1)
    dat <- as.data.frame(Nclus)
    ind <- sample(nrow(dat), 50)
    
    dat[["train"]] <- TRUE
    dat[["train"]][ind] <- FALSE
    
    cl1 = kcca(dat[dat[["train"]]==TRUE, 1:2], k=4, kccaFamily("kmeans"))
    cl1    
    #
    # call:
    # kcca(x = dat[dat[["train"]] == TRUE, 1:2], k = 4)
    #
    # cluster sizes:
    #
    #  1   2   3   4 
    #130 181  98  91 
    
    pred_train <- predict(cl1)
    pred_test <- predict(cl1, newdata=dat[dat[["train"]]==FALSE, 1:2])
    
    image(cl1)
    points(dat[dat[["train"]]==TRUE, 1:2], col=pred_train, pch=19, cex=0.3)
    points(dat[dat[["train"]]==FALSE, 1:2], col=pred_test, pch=22, bg="orange")
    

    flexclust plot

    There are also conversion methods to convert the results from cluster functions like stats::kmeans or cluster::pam to objects of class kcca and vice versa:

    as.kcca(cl, data=x)
    # kcca object of family ‘kmeans’ 
    #
    # call:
    # as.kcca(object = cl, data = x)
    #
    # cluster sizes:
    #
    #  1  2 
    #  50 50