I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!
Code:
data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")
Warning:
Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
2: did not converge in 10 iterations
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350)
4: did not converge in 10 iterations
You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:
data(iris)
scaled_iris <- scale(iris[, -5])
Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:
wss <- sum(colSums(scaled_iris^2))
wss
# [1] 596
Now the the clustering. I'll include the argument that @mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max=
to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart=
argument and set a value of 5 or more:
for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
head(wss);tail(wss)
# [1] 596.00000 220.87929 138.88836 113.97017 104.98669 81.03783
# [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
plot(wss, type='b', xlab="Clusters", ylab="WSS")
Note you misspelled withinss
and you did not realize that kmeans
returns their sum as tot.withinss
. It is always good to read the manual page ?kmeans
. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.