Search code examples
cluster-analysisdata-sciencek-meansunsupervised-learning

Selecting the K value for Kmeans clustering


I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.

For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.

The sum of squared error vs. The number of clusters

I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.


Solution

  • K-means is not suitable for outlier detection. This keeps popping up here all the time.

    1. K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
    2. K-means itself is known to not work well on noisy data where data points do not belong to the clusters
    3. It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
    4. It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.

    Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.