Search code examples
machine-learningscikit-learnrocaucprecision-recall

How to interpret this triangular shape ROC AUC curve?


I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:

//////////////////////////////////////////////////////

1= fr
0= non-fr
Class count:
0    69109
1    30891
dtype: int64
Accuracy: 0.95126
Classification report:
             precision    recall  f1-score   support

          0       0.97      0.96      0.96     34547
          1       0.92      0.93      0.92     15453

avg / total       0.95      0.95      0.95     50000

Confusion matrix:
[[33229  1318]
 [ 1119 14334]]
AUC= 0.944717975754

//////////////////////////////////////////////////////

1= en
0= non-en
Class count:
0    76125
1    23875
dtype: int64
Accuracy: 0.7675
Classification report:
             precision    recall  f1-score   support

          0       0.91      0.78      0.84     38245
          1       0.50      0.74      0.60     11755

avg / total       0.81      0.77      0.78     50000

Confusion matrix:
[[29677  8568]
 [ 3057  8698]]
AUC= 0.757955582999

//////////////////////////////////////////////////////

However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?

enter image description here enter image description here

Codes:

    all_dict = []
    for i in range(0, len(my_dict)):
        temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
            + my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
            + my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
            + my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
            )
        all_dict.append(temp_dict)

    newX = dv.fit_transform(all_dict)

    # Separate the training and testing data sets
    half_cut = int(len(df)/2.0)*-1
    X_train = newX[:half_cut]
    X_test = newX[half_cut:]
    y_train = y[:half_cut]
    y_test = y[half_cut:]

    # Fitting X and y into model, using training data
    #$$
    lr.fit(X_train, y_train)

    # Making predictions using trained data
    #$$
    y_train_predictions = lr.predict(X_train)
    #$$
    y_test_predictions = lr.predict(X_test)

    #print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
    print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])

    print 'Classification report:'
    print classification_report(y_test, y_test_predictions)
    #print sk_confusion_matrix(y_train, y_train_predictions)
    print 'Confusion matrix:'
    print sk_confusion_matrix(y_test, y_test_predictions)

    #print y_test[1:20]
    #print y_test_predictions[1:20]

    #print y_test[1:10]
    #print np.bincount(y_test)
    #print np.bincount(y_test_predictions)

    # Find and plot AUC
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print 'AUC=',roc_auc

    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

Solution

  • You're doing it wrong. According to documentation:

    y_score : array, shape = [n_samples]
    
        Target scores, can either be probability estimates of the positive class or confidence values.
    

    Thus at this line:

    roc_curve(y_test, y_test_predictions)
    

    You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.

    Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py

    http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py