Search code examples
pythoncluster-analysis

Calculating cluster accuracy


I would like to write a python code to calculate cluster accuracy r as the followings:

r = (A1+ ... +Ai+ ...Ak) / (the number of data objects)

where Ai is the number of data objects occurring both in i-th cluster and its corresponding true cluster.

I need to implement it in order to compare the clustering performance with a research paper which uses this accuracy criteria.
I searched for existing methods in sklearn, but could not find the one to do this and tried to write it by myself.

Here is the code I wrote:

    # For each label in prediction, extract true labels of the same 
    # index as 'labels'. Then count the number of instances of respective
    # true labels in 'labels', and assume the one with the maximum 
    # number of instances is the corresponding true label.
    pred_to_true_conversion={}
    for p in np.unique(pred):
        labels=true[pred==p]
        unique, counts=np.unique(labels, return_counts=True)
        label_count=dict(zip(unique, counts))
        pred_to_true_conversion[p]=max(label_count, key=label_count.get)

    # count the number of instances whose true label is the same
    # as the converted predicted label.
    count=0
    for t, p in zip(true, pred):
        if t==pred_to_true_conversion[p]: count+=1

    return count/len(true)

However, I do not think my "label remapping" approach is a clever way and there should be a better way to calculate r. My method have problems such as:

  1. It relies on an assumption that the corresponding true label is the one that occurs most frequently in a predicted cluster, but it is not always the case.
  2. Different predicted cluster labels are correlated to the same true cluster label, especially when the number of classes are different in true labels and predicted labels.

How can I implement the accuracy r? Or is there a method in any existing clustering libraries to do this?


Solution

  • I believe what you are describing is something I also wanted to do a while back. This is how I solved it:

    from sklearn.metrics.cluster import contingency_matrix
    from sklearn.preprocessing import normalize
    
    normalize(contingency_matrix(labels_pred=pred, labels_true=true), norm='l1', axis=1)
    

    This matrix gives the recall for each cluster/label combination.

    edit:

    The problems you state with this method I believe are inherent to it. For some reason some papers prefer to report accuracy or F measure for clustering results even though these aren't quite suited for it.This paper uses a different way of calculating an F-measure for clustering results, that at least solves the multiple clusters being mapped to a single truth label problem. They use a task assignment algorithm to solve this specific issue.

    This is my code for the 'Hungarian F1' score:

    from munkres import Munkres
    def f_matrix(labels_pred, labels_true):
        # Calculate F1 matrix
        cont_mat = contingency_matrix(labels_pred=labels_pred, labels_true=labels_true)
        precision = normalize(cont_mat, norm='l1', axis=0)
        recall = normalize(cont_mat, norm='l1', axis=1)
        som = precision + recall
        f1 =  np.round(np.divide((2 * recall * precision), som, out=np.zeros_like(som), where=som!=0), 3)
        return f1
    
    def f1_hungarian(f1):
        m = Munkres()
        inverse = 1 - f1
        indices = m.compute(inverse.tolist())
        fscore = sum([f1[i] for i in indices])/len(indices)
        return fscore
    f1_hungarian(f_matrix(pred, true))