Search code examples
pythonmachine-learningscikit-learncross-validationtext-classification

Cross Validation classification error


I am using the following code to get the classification results:

 folds = 5 #number of folds for the cv

        #Logistic Regression--
        clf = linear_model.LogisticRegression(penalty='l1')
        kf = KFold

(len(clas), n_folds=folds)
    fold = 1
    cms = np.array([[0,0],[0,0]])
    accs = []
    aucs=[]
    for train_index, test_index in kf:
        X_train, X_test = docs[train_index], docs[test_index]
        y_train, y_test = clas2[train_index], clas2[test_index]
        clf.fit(X_train, y_train)
        prediction = clf.predict(X_test)
        acc = accuracy_score(prediction, y_test)
        cm = confusion_matrix(y_test,prediction)
        pred_probas = clf.predict_proba(X_test)[:,1]
        fpr, tpr, thresholds = metrics.roc_curve(y_test, pred_probas)
        print('Test Accuracy for fold {}: {}\n{}'.format(fold,round((acc*100),2),cm))
        roc_auc = auc(fpr,tpr)
        print('AUC for fold {} : {}'.format(fold,round((roc_auc*100),2)))
        fold +=1
        cms += cm
        accs.append(acc)
        aucs.append(roc_auc)
    print('CV test accuracy: {}\n{}'.format(round((np.mean(accs)*100),2),cms))
    print('\nCV AUC: {}'.format(round(np.mean(aucs)*100),2))
    print('\nCV accuracy: %.3f +/- %.3f' % (round((np.mean(accs)*100),2),round((np.std(accs)*100),2)))
    print('\nCV ROC AUC: %.3f +/- %.3f' % (round((np.mean(aucs)*100),2),round((np.std(aucs)*100),2)))
    print('\nPeak accuracy: '+str(round((np.amax(accs)*100),2)))
    print('\nPeak ROC AUC: '+str(round((np.amax(aucs)*100),2)))

I am not sure if I am doing something wring but I have 2 classes Yes= 406 No= 139, and the code is giving me following result

Test Accuracy for fold 1: 87.16
[[94  9]
 [ 5  1]]
AUC for fold 1 : 66.1
Test Accuracy for fold 2: 92.66
[[100   6]
 [  2   1]]
AUC for fold 2 : 62.42
Test Accuracy for fold 3: 90.83
[[99  7]
 [ 3  0]]
AUC for fold 3 : 43.08
Test Accuracy for fold 4: 88.07
[[83  8]
 [ 5 13]]
AUC for fold 4 : 85.5
Test Accuracy for fold 5: 53.21
[[ 0  0]
 [51 58]]
AUC for fold 5 : nan
CV test accuracy: 82.39
[[376  30]
 [ 66  73]]

CV AUC: nan

CV accuracy: 82.390 +/- 14.720

CV ROC AUC: nan +/- nan

Peak accuracy: 92.66

Peak ROC AUC: nan
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:530: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless
  UndefinedMetricWarning)
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:95: RuntimeWarning: invalid value encountered in less
  if np.any(dx < 0):

Initially I just had 17 No docs but it was working fine.. Can someone point out some mistake or explain what is going on?


Solution

  • Basically you have one very small class (something around 20-30 samples?) and in one of the splits you did not get any, thus leading to errors. You can use StratifiedKFold instead, which guarantees that in each split you have a constant amount of samples from each class.