I would like to write a python code to calculate cluster accuracy r
as the followings:
r = (
A1
+ ... +
Ai
+ ...
Ak
) / (the number of data objects)
where Ai
is the number of data objects occurring both in i-th cluster and its corresponding true cluster.
I need to implement it in order to compare the clustering performance with a research paper which uses this accuracy criteria.
I searched for existing methods in sklearn, but could not find the one to do this and tried to write it by myself.
Here is the code I wrote:
# For each label in prediction, extract true labels of the same
# index as 'labels'. Then count the number of instances of respective
# true labels in 'labels', and assume the one with the maximum
# number of instances is the corresponding true label.
pred_to_true_conversion={}
for p in np.unique(pred):
labels=true[pred==p]
unique, counts=np.unique(labels, return_counts=True)
label_count=dict(zip(unique, counts))
pred_to_true_conversion[p]=max(label_count, key=label_count.get)
# count the number of instances whose true label is the same
# as the converted predicted label.
count=0
for t, p in zip(true, pred):
if t==pred_to_true_conversion[p]: count+=1
return count/len(true)
However, I do not think my "label remapping" approach is a clever way and there should be a better way to calculate r
. My method have problems such as:
How can I implement the accuracy r
? Or is there a method in any existing clustering libraries to do this?
I believe what you are describing is something I also wanted to do a while back. This is how I solved it:
from sklearn.metrics.cluster import contingency_matrix
from sklearn.preprocessing import normalize
normalize(contingency_matrix(labels_pred=pred, labels_true=true), norm='l1', axis=1)
This matrix gives the recall for each cluster/label combination.
edit:
The problems you state with this method I believe are inherent to it. For some reason some papers prefer to report accuracy or F measure for clustering results even though these aren't quite suited for it.This paper uses a different way of calculating an F-measure for clustering results, that at least solves the multiple clusters being mapped to a single truth label problem. They use a task assignment algorithm to solve this specific issue.
This is my code for the 'Hungarian F1' score:
from munkres import Munkres
def f_matrix(labels_pred, labels_true):
# Calculate F1 matrix
cont_mat = contingency_matrix(labels_pred=labels_pred, labels_true=labels_true)
precision = normalize(cont_mat, norm='l1', axis=0)
recall = normalize(cont_mat, norm='l1', axis=1)
som = precision + recall
f1 = np.round(np.divide((2 * recall * precision), som, out=np.zeros_like(som), where=som!=0), 3)
return f1
def f1_hungarian(f1):
m = Munkres()
inverse = 1 - f1
indices = m.compute(inverse.tolist())
fscore = sum([f1[i] for i in indices])/len(indices)
return fscore
f1_hungarian(f_matrix(pred, true))