Search code examples
machine-learningdata-miningcluster-analysis

Finding Accuracy of a Clustering Algorithm


How to find the accuracy of a clustering algorithm given the true clustering and predicted clustering of the algorithm?

I searched the web but couldn't find any useful source. I know how to compute accuracy of classification algorithm.


Solution

  • There exist a number of methods, and some of them are discussed on the Wikipedia page "Cluster analysis", section "External evaluation".

    Pair-counting based indexes (F-Measure, Rand, etc.) seem to be the most popular. They are quite easy to compute; actually easier than the some of the set matching measures (hungarian algorithm to find the optimal 1:1 alignment is in O(n^3), while all the pair counting measures can be computed in a linear pass over the intersection matrix, so in O(n^2). (n is the number of clusters.)

    You can find a novel visual experiment (but in my experience it is not that useful on real data, more for understanding the differences of two algorithms on 2d toy data) based on the pair counting measures (along with an implementation of a dozen of external measures) in:

    Achtert, Elke, et al. "Evaluation of Clusterings--Metrics and Visual Support." Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 2012.

    Note that there is a big issue with comparing a new clustering to "known" clusterings:

    By doing so, you actually punish novel solutions.

    But when using cluster analysis, you want a novel solution. If it were just the labels you already had, you could just use the labels you already have. In fact, a good clustering result will diverge from the known solution, and offer an alternate view on the data.