Kmeans: Wrong size of clusters

I am running Kmeans algorithm in R on Heart Disease UCI dataset. I am supposed to get 2 clusters with 138 165 size for each like what in the data set.

Steps:

Store dataset in a data frame:

df <- read.csv(".../heart.csv",fileEncoding = "UTF-8-BOM")

Extract the features:

features = subset(df, select = -target)

Normalize it:

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

features = data.frame(sapply(features, normalize))

Run the algorithm:

set.seed(0)
cluster = kmeans(features, 2)
cluster$size

Output:

[1]  99 204

Why?

Solution

It seems like you're focusing on the size of the clusters rather than the accuracy of your predictions. You may well get two clusters of size (138, 165) but not necessarily the same clusters as the 'target' column in the data.

A better way of judging performance is the accuracy of your predictions. In your case, your model accuracy is 72%. You can see this by:

df$label <- cluster$cluster -1

confusionMatrix(table(df$target, df$label))

#Confusion Matrix and Statistics
#   
#      0   1
#  0  76  62
#  1  23 142
#                                          
#               Accuracy : 0.7195 
# ...

I was able to get a better accuracy by standardizing the data rather than normalizing. Possibly because standardizing is more robust to outliers.

I also dummy-coded the categorical looking variables which seems to have improved the accuracy. We now have 85% accuracy and the cluster size is closer to what we expect (143 160). Although, as discussed, on its own the cluster size is meaningless.

library(dplyr)
library(fastDummies)
library(caret)
standardize <- function(x){
  num <- x - mean(x, na.rm=T)
  denom <- sd(x, na.rm=T)

  num/denom
}

# dummy-code and standardize
features <-  select(df, -target) %>%
   dummy_cols(select_columns = c('cp','thal', 'ca'),
              remove_selected_columns = T,remove_first_dummy  = T) %>%
  mutate_all(standardize)

set.seed(0)
cluster <- kmeans(features, centers = 2, nstart = 50)

cluster$size
# 143 160

# check predictions vs actual labels
df$label <- cluster$cluster -1

confusionMatrix(table(df$target, df$label))
#Confusion Matrix and Statistics
#
#   
#      0   1
#  0 117  21
#  1  26 139
#                                          
#               Accuracy : 0.8449

Of course, there are other accuracy metrics worth considering too such as out-of-sample accuracy (split your data into training and test sets, and calculate accuracy of predictions on your test set), and f1-score.