Search code examples
pythonscikit-learncluster-analysisconfusion-matrixscikits

Confusion matrix for Clustering in scikit-learn


I have a set of data with known labels. I want to try clustering and see if I can get the same clusters given by known labels. To measure the accuracy, I need to get something like a confusion matrix.

I know I can get a confusion matrix easily for a test set of a classification problem. I already tried that like this.

However, it can't be used for clustering as it expected both columns and rows to have the same set of labels, which makes sense for a classification problem. But for a clustering problem what I expect is something like this.

Rows - Actual labels

Columns - New cluster names (i.e. cluster-1, cluster-2 etc.)

Is there a way to do this?

Edit: Here are more details.

In sklearn.metrics.confusion_matrix, it expects y_test and y_pred to have the same values, and labels to be the labels of those values.

That's why it gives a matrix which has the same labels for both rows and columns like this.

enter image description here

But in my case (KMeans Clustering), the real values are Strings and estimated values are numbers (i.e. cluster number)

Therefore, if I call confusion_matrix(y_true, y_pred) it gives below error.

ValueError: Mix of label input types (string and number)

This is the real problem. For a classification problem, this makes sense. But for a clustering problem, this restriction shouldn't be there, because real label names and new cluster names don't need to be the same.

With this, I understand I'm trying to use a tool, which is supposed to be used for classification problems, for a clustering problem. So, my question is, is there a way I can get such a matrix for may clustered data.

Hope the question is now clearer. Please let me know if it isn't.


Solution

  • I wrote a code myself.

    # Compute confusion matrix
    def confusion_matrix(act_labels, pred_labels):
        uniqueLabels = list(set(act_labels))
        clusters = list(set(pred_labels))
        cm = [[0 for i in range(len(clusters))] for i in range(len(uniqueLabels))]
        for i, act_label in enumerate(uniqueLabels):
            for j, pred_label in enumerate(pred_labels):
                if act_labels[j] == act_label:
                    cm[i][pred_label] = cm[i][pred_label] + 1
        return cm
    
    # Example
    labels=['a','b','c',
            'a','b','c',
            'a','b','c',
            'a','b','c']
    pred=[  1,1,2,
            0,1,2,
            1,1,1,
            0,1,2]
    cnf_matrix = confusion_matrix(labels, pred)
    print('\n'.join([''.join(['{:4}'.format(item) for item in row])
          for row in cnf_matrix]))
    

    Edit: (Dayyyuumm) just found that I could do this easily with Pandas Crosstab :-/.

    labels=['a','b','c',
            'a','b','c',
            'a','b','c',
            'a','b','c']
    pred=[  1,1,2,
            0,1,2,
            1,1,1,
            0,1,2]   
    
    # Create a DataFrame with labels and varieties as columns: df
    df = pd.DataFrame({'Labels': labels, 'Clusters': pred})
    
    # Create crosstab: ct
    ct = pd.crosstab(df['Labels'], df['Clusters'])
    
    # Display ct
    print(ct)