Search code examples
rcluster-analysisknn

Simple approach to assigning clusters for new data after k-modes clustering


I am using a k-modes model (mymodel) which is created by a data frame mydf1. I am looking to assign the nearest cluster of mymodel for each row of a new data frame mydf2. Similar to this question - just with k-modes instead of k-means. The predict function of the flexclust package only works with numeric data, not categorial.

A short example:

require(klaR)
set.seed(100)
mydf1 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
                    var2 = as.character(sample(1:20, 50, replace = T)),
                    var3 = as.character(sample(1:20, 50, replace = T)))
mydf2 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
                    var2 = as.character(sample(1:20, 50, replace = T)),
                    var3 = as.character(sample(1:20, 50, replace = T)))
mymodel <- klaR::kmodes(mydf1, modes = 5)
# Get mode centers
mycenters <- mymodel$modes
# Now I would want to predict which of the 5 clusters each row 
# of mydf2 would be closest to, e.g.:
# cluster2 <- predict(mycenters, mydf2)

Is there already a function which can predict with a k-modes model or what would be the simplest way to do that? Thanks!


Solution

  • We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.

    ## From klaR::kmodes
    
    distance <- function(mode, obj, weights) {
      if (is.null(weights)) 
        return(sum(mode != obj))
      obj <- as.character(obj)
      mode <- as.character(mode)
      different <- which(mode != obj)
      n_mode <- n_obj <- numeric(length(different))
      for (i in seq(along = different)) {
        weight <- weights[[different[i]]]
        names <- names(weight)
        n_mode[i] <- weight[which(names == mode[different[i]])]
        n_obj[i] <- weight[which(names == obj[different[i]])]
      }
      dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
      return(dist)
    }
    
    AssignCluster <- function(df,kmeansObj)
    {
      apply(
        apply(df,1,function(obj)
      {
        apply(kmeansObj$modes,1,distance,obj,NULL)
      }),
      2, which.min)
    }
    
    AssignCluster(mydf2,mymodel)
    
    [1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1
    

    Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min will then choose the cluster with the lowest number.