Search code examples
rcluster-analysisk-meansdata-miningcredit-card

How to perform k-mean clustering in R


I am trying to explore a creditcard fraud dataset to learn R and also k-means clustering. But I encountered an issue while getting the optimal number of clusters. Unfortunately, not many findings about that error or even how to performing kmeans clustering in R can be google. I would like to know what's the warning about? And why the result only show 1 cluster? Thanks in advance!

Code:

data = read.csv("creditcard.csv")
scaled_data <- scale(data )
wss <- (nrow(scaled_data)-1)*sum(apply(scaled_data,2,var))
for (i in 2:100) wss[i] <- sum(kmeans(scaled_data, centers=i)$withiness)
plot(1:100, wss, type='b', xlab="Clusters", ylab="WSS")

Warning:

Warning messages:
1: Quick-TRANSfer stage steps exceeded maximum (= 14240350) 
2: did not converge in 10 iterations 
3: Quick-TRANSfer stage steps exceeded maximum (= 14240350) 
4: did not converge in 10 iterations 

Solution

  • You have several issues with your code. Let's go through it using an example data set available on R since you did not provide reproducible data:

    data(iris)
    scaled_iris <- scale(iris[, -5])
    

    Since the data have been scaled, all of the variances are 1 so this is all you need to compute the total:

    wss <- sum(colSums(scaled_iris^2))
    wss
    # [1] 596
    

    Now the the clustering. I'll include the argument that @mhovd mentions with its default value (there is no argument for convergence). If you get the warning increase iter.max= to 15 or 20 or more. This does not guarantee that your results for any number of groups are optimal. To increase the chances of that you should use the nstart= argument and set a value of 5 or more:

    for (i in 2:100) wss[i] <- kmeans(scaled_iris, centers=i, iter.max=10)$tot.withinss
    head(wss);tail(wss)
    # [1] 596.00000 220.87929 138.88836 113.97017 104.98669  81.03783
    # [1] 3.188483 2.688470 2.716485 2.535701 2.497792 2.116150
    plot(wss, type='b', xlab="Clusters", ylab="WSS")
    

    Note you misspelled withinss and you did not realize that kmeans returns their sum as tot.withinss. It is always good to read the manual page ?kmeans. Note that you do not need 1:100 since the plot function will automatically supply consecutive integers if you provide only one vector.

    Plot