Search code examples
pythonscikit-learncluster-analysisk-meansmultilabel-classification

How to evaluate K-Means Clustering since automatic indexes of clusters don't match true labels?


How do we measure the accuracy of a K-Means clustering algorithm (say, generate a confusion matrix) since the automatic indexes of cluster is probably a permutation of the original labels?


Solution

  • I don't exactly know what you mean too. Your original labels perhaps is the ground truth labeling. The clustering results provided by k-means is usually an integer with range given as many as the k clusters you wish the k-means algorithm to give you.

    I typically use pandas.crosstab function to visualize the localizations of the groundtruth labeling with kmeans labeling with cross-tabulation.

    For better visualization, you may want to use the following:

    import seaborn as sns
    import matplotlib.pyplot as plt
    
    plt.figure(figsize=(30,10))
    
    # plot the heatmap for correlation matrix
    ax = sns.heatmap(crosstab_groundtruth_kmeans.T, 
                    square=True, annot=True, fmt='.2f')
    
    ax.set_yticklabels(
        ax.get_yticklabels(),
        rotation=0);
    

    out: enter image description here

    Good luck!~