I am using a k-modes model (mymodel
) which is created by a data frame mydf1
. I am looking to assign the nearest cluster of mymodel
for each row of a new data frame mydf2
.
Similar to this question - just with k-modes instead of k-means. The predict
function of the flexclust
package only works with numeric data, not categorial.
A short example:
require(klaR)
set.seed(100)
mydf1 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mydf2 <- data.frame(var1 = as.character(sample(1:20, 50, replace = T)),
var2 = as.character(sample(1:20, 50, replace = T)),
var3 = as.character(sample(1:20, 50, replace = T)))
mymodel <- klaR::kmodes(mydf1, modes = 5)
# Get mode centers
mycenters <- mymodel$modes
# Now I would want to predict which of the 5 clusters each row
# of mydf2 would be closest to, e.g.:
# cluster2 <- predict(mycenters, mydf2)
Is there already a function which can predict with a k-modes model or what would be the simplest way to do that? Thanks!
We can use the distance measure that is used in the kmodes algorithm to assign each new row to its nearest cluster.
## From klaR::kmodes
distance <- function(mode, obj, weights) {
if (is.null(weights))
return(sum(mode != obj))
obj <- as.character(obj)
mode <- as.character(mode)
different <- which(mode != obj)
n_mode <- n_obj <- numeric(length(different))
for (i in seq(along = different)) {
weight <- weights[[different[i]]]
names <- names(weight)
n_mode[i] <- weight[which(names == mode[different[i]])]
n_obj[i] <- weight[which(names == obj[different[i]])]
}
dist <- sum((n_mode + n_obj)/(n_mode * n_obj))
return(dist)
}
AssignCluster <- function(df,kmeansObj)
{
apply(
apply(df,1,function(obj)
{
apply(kmeansObj$modes,1,distance,obj,NULL)
}),
2, which.min)
}
AssignCluster(mydf2,mymodel)
[1] 4 3 4 1 1 1 2 2 1 1 5 1 1 3 2 2 1 3 3 1 1 1 1 1 3 1 1 1 3 1 1 1 1 2 1 5 1 3 5 1 1 4 1 1 2 1 1 1 1 1
Please note that this will likely produce a lot of entries that are equally far away from multiple clusters and which.min
will then choose the cluster with the lowest number.