Search code examples
pythonpandasscikit-learnsklearn-pandas

Sklearn: how to get mean squared error on classifying training data


I'm trying to do some classification problems using sklearn for the first time in Python, and was wondering what was the best way to go about calculating the error of my classifier (like a SVM) solely on the training data.

My sample code for calculating accuracy and rmse are as follows:

    svc = svm.SVC(kernel='rbf', C=C, decision_function_shape='ovr').fit(X_train, y_train.ravel())
    prediction = svc.predict(X_test)
    svm_in_accuracy.append(svc.score(X_train,y_train))
    svm_out_rmse.append(sqrt(mean_squared_error(prediction, np.array(list(y_test)))))
    svm_out_accuracy.append((np.array(list(y_test)) == prediction).sum()/(np.array(list(y_test)) == prediction).size)

I know from 'sklearn.metrics import mean_squared_error' can pretty much get me the MSE for an out-of-sample comparison. What can I do in sklearn to give me an error metric on how my well/not well my model misclassified on the training data? I ask this because I know my data is not perfectly linearly separable (which means the classifier will misclassify some items), and I want to know the best way to get an error metric on how much it was off. Any help would be appreciated!


Solution

  • To evaluate you classifier you can use the following metrics:

    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import classification_report
    from sklearn.metrics import roc_curve
    from sklearn.metrics import roc_auc_score
    

    The confusion matrix has the predicted labels as columns headings and the true labels are row labels. The main diagonal of the confusion matrix shows the number of correctly assigned labels. Any off-diagonal elements contain the number of incorrectly assigned labels. From the confusion matrix, you can also calculate accuracy, precision and recall. Both the classification report and the confusion matrix are straightforward to use - you pass the test and predicted labels to the functions:

    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    
    [[1047    5]
     [   0  448]]
    
                precision    recall  f1-score   support
    
            0.0       1.00      1.00      1.00      1052
            1.0       0.99      1.00      0.99       448
    
    avg / total       1.00      1.00      1.00      1500
    

    The other metrics functions calculate and plot the Receiver Operating Characteristic (ROC) and the Area under Curve (AUC) of the ROC. You can read about ROC here:

    http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

    http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html