cluster-analysis hierarchical-clustering hierarchical

how to compute accuracy of AgglomerativeClustering

hi i use the sample in python of AgglomerativeClustering i try to estimate the performance but it switches the original labels i try to compare the predicted labels y_hc and the original label y return by make blobs

import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
data,y = make_blobs(n_samples=300, n_features=2, centers=4, cluster_std=2, random_state=50)
plt.figure(2)
# create dendrogram
dendrogram = sch.dendrogram(sch.linkage(data, method='ward'))
plt.title('dendrogram')

# create clusters linkage="average", affinity=metric , linkage = 'ward' affinity = 'euclidean'
hc = AgglomerativeClustering(n_clusters=4, linkage="average", affinity='euclidean')

# save clusters for chart
y_hc = hc.fit_predict(data,y)

plt.figure(3)

# create scatter plot
plt.scatter(data[y==0,0], data[y==0,1], c='red', s=50)
plt.scatter(data[y==1, 0], data[y==1, 1], c='black', s=50)
plt.scatter(data[y==2, 0], data[y==2, 1], c='blue', s=50)
plt.scatter(data[y==3, 0], data[y==3, 1], c='cyan', s=50)

plt.xlim(-15,15)
plt.ylim(-15,15)


plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=10, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=10, c='black')
plt.scatter(data[y_hc ==2,0], data[y_hc == 2,1], s=10, c='blue')
plt.scatter(data[y_hc ==3,0], data[y_hc == 3,1], s=10, c='cyan')
for ii in range(4):
        print(ii)
        i0=y_hc==ii
        counts = np.bincount(y[i0])
        valCountAtorgLbl = (np.argmax(counts))
        accuracy0Tp=100*np.max(counts)/y[y==valCountAtorgLbl].shape[0]
        accuracy0Fp = 100 * np.min(counts) / y[y ==valCountAtorgLbl].shape[0]

print([accuracy0Tp,accuracy0Fp])
plt.show()

Solution

The clustering does and cannot reproduce the original labels, only the original partitions.

You seem to assume that cluster 1 corresponds to label 1 (in faftz one could be labeled 'iris setosa', and there obviously is no way an unsupervised algorithm will come up with this cluster name...). It usually won't - there probably isn't the same number of clusters and classes there either, and there could be unlabeled noise piintsl You can use the Hungarian algorithm to compute the optimum mapping (or just a greedy matching) to produce a more intuitive color mapping.