I am learning about performance metrics. I have a dataframe with 0-10099 rows and with two columns (Y_Actual, Y_Predicted). I would like to create a confusion matrix with pandas.
My first attempt:
y_actual= df5a["y"]
y_actual= y_actual.rename("Actual")
y_predicted=df5a["labels"]
y_predicted= y_predicted.rename("Predicted")
confusion_matrix_5a= pd.crosstab(y_actual, y_predicted)
confusion_matrix_5a
output1:
Predicted 1
Actual
0.0 100
1.0 10000
After checking all my Y_Predicted, I realized that all the values were "1". To get pandas.crosstab()
to create the matrix in this situation, I added an extra row to my dataframe (Y_actual=0, Y_predicted= 1).
output2:
Predicted 0 1
Actual
0.0 1 100
1.0 0 10000
The real confusion matrix should be:
Predicted 0 1
Actual
0.0 0 100
1.0 0 10000
The "1" in output2 is there because I added the extra row. I know this will not affect my accuracy because I have many rows, so the effect of adding the row will be negligible.
Do you know any other way to create the matrix with pandas.crosstab()
when you have a unique value in one of the columns? Any suggestions about how to do it without adding the extra row?
crosstab
picks up values present in the columns, so you need to populate the missing column manually. A simple way to do that is reindex
.
Let's say conf_mat
is your confusion matrix with only one column.
Then you can do conf_mat.reindex([0,1], axis = 'columns', fill_value = 0)
to force the dataframe to hold columns with names 0 and 1.
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html