Search code examples
pythonpython-3.xcluster-analysisk-means

How to improve result of k-means clustering


I have the following simple small table and I used the k-means clustering algorithm to cluster them.

|                 |Cluster| ItalianRe|Japanese|Pizza|Sandwich|Fast Food| 
|-----------------|-------|----------|--------|-----|--------|---------|
|Central Toronto  | 0     |33        |8       |17   |10      |2
|Downtown Toronto | 1     |77        |55      |12   |17      |14
|East Toronto     | 2     |7         |9       |2    |4       |3
|East York        | 2     |4         |3       |4    |3       |1
|Etobicoke        | 0     |18        |6       |20   |7       |9
|North York       | 2     |4         |9       |9    |13      |14
|Scarborough      | 3     |1         |8       |23   |15      |29
|West Toronto     | 2     |7         |5       |7    |7       |5
|York             | 2     |8         |4       |7    |2       |0

To me, Scarborough and North York look very similar with high number in "Sandwich" and "Fast Food" and same number in "Japanese". However, Scarborough is grouped by itself and North Yotk is grouped by four other items, which are not actually that familiar in the first glance.

I used the following code for clustering

# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=4,  ).fit(df)

Can anyone help me to understand why this happens or if there is anyway to fix this.

P.S. When I run my code yesterday, I assume it clustered those two in one group. But now it clustered like this.


Solution

  • intuitively, similarity along one dimension does not necessarily mean that two points are close to each other. To make visualizing easier, consider a 2-dimensional example of two points: one is (0,10) and the other is (0,0). The other points may be stuff like (1,1), (3,2), (-1,-3), etc... Now, you may look at the first two points and think that they both are very similar (the same, in fact) in the first dimension, so they should be grouped together. But if you visualize this example, it's clear that (0,0) is closer to the other points than it is to this first point.

    So this may provide some intuition over why similarity in 3 dimensions does not indicate closeness.

    Furthermore, the difference in fast food is still a rather large number between the two. If I recall correctly, k-means clustering seeks to minimize distances, and so "both have high numbers" does not mean anything, but "the distance in this dimension is 15" (a large distance in this dataset) does.