Search code examples
nlpcluster-analysisk-meanscentroid

Labels of clustered data and KMeans cluster centers


Relating to the question Starting question I have doubts regarding calculating coordinates of cluster centres and labeling the centres:

kmeans.cluster_centers_

gives

[[ 4.87744023 -0.48344163]
[ 8.29540909  6.7398487 ]
[ 1.05638163  3.84314976]]

I'm confused with the order of centres. The first one is 'green' cluster (label 2 in the plot), the second one is the 'red' cluster (label 0 in the plot) and last one is the 'blue' one with the label 1 in the plot. What is the logic behind it?

Also, what in case if I have labeled data for clustering as a starting point for clustering - for example Wine quality dataset WineQuality or Twitter sentiment analysis Twitter sentiment analisys. I know the labels for clusters and would like to perserve them as labels for clusters and of course to relate them to cluster centre?


Solution

  • The orders of clusters is usually arbitrary; there is no significance attached to them. It probably depends on the order in which the data points are processed, but doesn't really make any difference, as they're just labels.

    If your data points already have labels, then simply take the n data points closest to the centre of each cluster, and assign it the most frequent label. It is unlikely that you will get a perfect clustering as in the example, as there will commonly be data points assigned to a different cluster, or in-between clusters.

    The procedure would basically be:

    1. set up an (empty) list for each cluster.
    2. for each labelled data point, find the closest centre and add the label to its list
    3. for each cluster, count how many times each label occurs in its list and pick the highest value label.