Search code examples
pythonscikit-learncluster-analysis

How to best align data labeling for comparison


I have different labels for the same data points - for example, clustering the data using sklearn.cluster.KMeans and sklearn.cluster.AgglomerativeClustering, and getting somewhat different clusters.

I want to see the differences in the results of the two approaches, but simply comparing the cluster number each data point was given under each method is not possible, as the numbers are given arbitrarily.

That is, even if a number of data points lands in one cluster under the two regimes, in one it would be numbered, for example, '2', and in the other '0' - the numbers are meaningless beyond specifying different categories.
Comparing these labels would (incorrectly) show that the two methods strongly disagree regarding these points, even if they land in the same cluster.

While I could just iterate over all the possible permutations of one list of labels and compare each option's agreement with the the other list (that is, swap labels in one list while keeping the other list the same), settling on the option with the smallest number of disagreements, I assume there is a saner option, and likely - one that already exists.

Any ideas?

Clustering label data example :

label_a= [1 1 5 2 2 2 3 3 2 2 3 2 2 2 2 3 2 3 2 2 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 4 4 4 5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 3 4 4 2 4]

label_b=  [3 3 4 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 3 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 2]

Solution

  • As answered several times before:

    1. Use measures such as ARI, NMI that don't need labels to be "aligned", but that compare partitions, not labels (standard)
    2. Use the Hungarian algorithm to find the best alignment (uncommon, and you'll still have to handle the case where they don't have the same number of clusters)