Search code examples
pythonpython-3.xmachine-learningcluster-analysisk-means

What would be the best k for this kmeans clustering? (Elbow point plot)


I am trying kmeans to find the optimal place to start a coffee shop near subway station in Seoul.

Included features are:

  1. Total monthly alights on a particular station
  2. Rental Fees near a particular station
  3. Number of existing coffee shops near a particular station

I decided to use elbow point to find the best k. I did standardize all the features before running kmeans.

enter image description here

Now the elbow point seems to be k=3(or maybe k=2), but I think the SSE is too high for an elbow point.

Also using k=3, it was difficult to gain insights from the clusters because there were only three of them.

Using k=5 was the sweet spot to gain insights.

Can using k=5 be justified even if it's not an elbow point?

Or is kmeans not a good option in the first place?


Solution

  • The elbow-point is not a definitive rule but is more of a heuristic method (it works most of the time but not always, so I see it more like is a good rule-of-thumb for choosing a number of clusters to start from). On top of that, the elbow-point cannot always be unambiguously identified so you shouldn't worry too much about it.

    So in that case, if you get better results/gain in how you understand your data using k=5, then I would highly suggest you to use k=5 rather than k=3!

    Now, for your other question, there may be approaches that would better suit your data but it doesn't mean k-means isn't a good way to start. If you want to try other things, the scikit-learn library documentation provides good insights on which algorithm or method to use when doing clustering.