Search code examples
pythonmachine-learningcluster-analysisk-meanssimilarity

How can you compare two cluster groupings in terms of similarity or overlap in Python?


Simplified example of what I'm trying to do:

Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:

  • 100% score for [(A,B),(C)] vs. [(A,B),(C)]
  • ~50% score for [(A,B),(C)] vs. [(A),(B,C)]
  • ~20% score for [(A,B),(C)] vs. [(A,B,C)]

These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).

Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?


Solution

  • Use evaluation metrics.

    Many metrics are symmetric. For example, the adjusted Rand index.

    A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.