python machine-learning scikit-learn cross-validation text-classification

Cross Validation classification error

I am using the following code to get the classification results:

 folds = 5 #number of folds for the cv

        #Logistic Regression--
        clf = linear_model.LogisticRegression(penalty='l1')
        kf = KFold

(len(clas), n_folds=folds)
    fold = 1
    cms = np.array([[0,0],[0,0]])
    accs = []
    aucs=[]
    for train_index, test_index in kf:
        X_train, X_test = docs[train_index], docs[test_index]
        y_train, y_test = clas2[train_index], clas2[test_index]
        clf.fit(X_train, y_train)
        prediction = clf.predict(X_test)
        acc = accuracy_score(prediction, y_test)
        cm = confusion_matrix(y_test,prediction)
        pred_probas = clf.predict_proba(X_test)[:,1]
        fpr, tpr, thresholds = metrics.roc_curve(y_test, pred_probas)
        print('Test Accuracy for fold {}: {}\n{}'.format(fold,round((acc*100),2),cm))
        roc_auc = auc(fpr,tpr)
        print('AUC for fold {} : {}'.format(fold,round((roc_auc*100),2)))
        fold +=1
        cms += cm
        accs.append(acc)
        aucs.append(roc_auc)
    print('CV test accuracy: {}\n{}'.format(round((np.mean(accs)*100),2),cms))
    print('\nCV AUC: {}'.format(round(np.mean(aucs)*100),2))
    print('\nCV accuracy: %.3f +/- %.3f' % (round((np.mean(accs)*100),2),round((np.std(accs)*100),2)))
    print('\nCV ROC AUC: %.3f +/- %.3f' % (round((np.mean(aucs)*100),2),round((np.std(aucs)*100),2)))
    print('\nPeak accuracy: '+str(round((np.amax(accs)*100),2)))
    print('\nPeak ROC AUC: '+str(round((np.amax(aucs)*100),2)))

I am not sure if I am doing something wring but I have 2 classes Yes= 406 No= 139, and the code is giving me following result

Test Accuracy for fold 1: 87.16
[[94  9]
 [ 5  1]]
AUC for fold 1 : 66.1
Test Accuracy for fold 2: 92.66
[[100   6]
 [  2   1]]
AUC for fold 2 : 62.42
Test Accuracy for fold 3: 90.83
[[99  7]
 [ 3  0]]
AUC for fold 3 : 43.08
Test Accuracy for fold 4: 88.07
[[83  8]
 [ 5 13]]
AUC for fold 4 : 85.5
Test Accuracy for fold 5: 53.21
[[ 0  0]
 [51 58]]
AUC for fold 5 : nan
CV test accuracy: 82.39
[[376  30]
 [ 66  73]]

CV AUC: nan

CV accuracy: 82.390 +/- 14.720

CV ROC AUC: nan +/- nan

Peak accuracy: 92.66

Peak ROC AUC: nan
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:530: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless
  UndefinedMetricWarning)
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:95: RuntimeWarning: invalid value encountered in less
  if np.any(dx < 0):

Initially I just had 17 No docs but it was working fine.. Can someone point out some mistake or explain what is going on?

Solution

Basically you have one very small class (something around 20-30 samples?) and in one of the splits you did not get any, thus leading to errors. You can use StratifiedKFold instead, which guarantees that in each split you have a constant amount of samples from each class.