Search code examples
rcluster-analysisk-means

kmeans bug when specifying starting cluster centers in R?


I am trying to run kmeans step by step in R. When I set iter.max = 1 and specify the starting cluster centers in place of k, the algorithm seems to be running until it converges instead of the specified 1 iteration.

Could anyone confirm this is a known bug? If not, anything I am missing?

Here is my code for reference:

# Set up data
data <- data.frame(names = c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2"), 
                   x = c(2, 2, 8, 5, 7, 6, 1, 4),
                   y = c(10, 5, 4, 8, 5, 4, 2, 9))

initial_centers <- matrix(c(2, 5, 1, 10, 8, 2), ncol=2)

# Run k means for 1 iteration
model <- kmeans(data[,-1], initial_centers, iter.max=1)
model$centers

# Actual Output:
#          x        y
# 1 3.666667 9.000000
# 2 7.000000 4.333333
# 3 1.500000 3.500000

# Expected Output:
#          x        y
# 1 2.000000 10.00000
# 2 6.000000 6.000000
# 3 1.500000 3.500000

Solution

  • The default k-means algorithm in R is more clever than what you learned in class. It's Hartigan and Wong's algorithm.

    If you want to assign each point to the nearest predefined center, don't abuse kmeans for this. Instead, just computer the distances and use argmin.