Search code examples
javascriptcluster-analysisdata-visualizationk-means

Clustering commercial information using K-means for visual mapping


I'm trying to perform clustering on small datasets shown to end-users:

[
  [1.76, 81, 5, 0],
  [2.99, 72, 5, 0],
  [11.17, 420, 4.8, 0],
  [1.76, 53, 5, 0],
  [16.73, 3403, 5, 0],
  ... // 20 entries per user
]

Columns are 1) retail price, 2) fulfilled orders, 3) rating and 4) shipping respectively.
I want to cluster this data into several groups to visualize it on JS frontend.

I'm using ecStat for echarts and it does work but is constantly changing results.

[1, 1, 1, 1, 2, 3, 1, 1, 3, 1, 1, 4, 0, 3, 3, 1, 1, 1, 1, 1]
[3, 3, 3, 3, 4, 2, 3, 3, 2, 3, 3, 1, 0, 2, 2, 3, 3, 3, 3, 3]
[3, 3, 3, 3, 4, 2, 3, 3, 2, 3, 3, 1, 0, 2, 2, 3, 3, 3, 3, 3]
[2, 2, 2, 2, 0, 3, 2, 2, 3, 2, 2, 4, 1, 3, 3, 2, 2, 2, 2, 2]

Thus I can't visualize it properly, since I am using size/color visual mapping based on clusters.
Like here we have 3 most cheap items with highest rating of green color and max radius, 5 items of medium price and yellowish color, 8 items of red color and minimal size, and so on.

Is it possible to get 'stable' results within 'set' clusters? Is it even a viable idea to use k-means and such tools for clustering items with lowest price, highest ratings, number of orders, etc.

How should one approach such tasks in general? Any advice is very appreciated!


Solution

  • K-means begins with a random initialization by default.

    If you don't want that, you can, e.g.,

    1. Use a stable algorithm instead
    2. Choose the previous centers as starting points

    Unstable (except for permutation) results usually indicate suboptimal clustering. K-means is also sensitive to scale. So it probably does not make sense to just use it on the data you have there. You need to understand what it does, and how you need to prepare your data to get useful results.