I am using the following code to get the classification results:
folds = 5 #number of folds for the cv
#Logistic Regression--
clf = linear_model.LogisticRegression(penalty='l1')
kf = KFold
(len(clas), n_folds=folds)
fold = 1
cms = np.array([[0,0],[0,0]])
accs = []
aucs=[]
for train_index, test_index in kf:
X_train, X_test = docs[train_index], docs[test_index]
y_train, y_test = clas2[train_index], clas2[test_index]
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)
acc = accuracy_score(prediction, y_test)
cm = confusion_matrix(y_test,prediction)
pred_probas = clf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, pred_probas)
print('Test Accuracy for fold {}: {}\n{}'.format(fold,round((acc*100),2),cm))
roc_auc = auc(fpr,tpr)
print('AUC for fold {} : {}'.format(fold,round((roc_auc*100),2)))
fold +=1
cms += cm
accs.append(acc)
aucs.append(roc_auc)
print('CV test accuracy: {}\n{}'.format(round((np.mean(accs)*100),2),cms))
print('\nCV AUC: {}'.format(round(np.mean(aucs)*100),2))
print('\nCV accuracy: %.3f +/- %.3f' % (round((np.mean(accs)*100),2),round((np.std(accs)*100),2)))
print('\nCV ROC AUC: %.3f +/- %.3f' % (round((np.mean(aucs)*100),2),round((np.std(aucs)*100),2)))
print('\nPeak accuracy: '+str(round((np.amax(accs)*100),2)))
print('\nPeak ROC AUC: '+str(round((np.amax(aucs)*100),2)))
I am not sure if I am doing something wring but I have 2 classes Yes= 406 No= 139, and the code is giving me following result
Test Accuracy for fold 1: 87.16
[[94 9]
[ 5 1]]
AUC for fold 1 : 66.1
Test Accuracy for fold 2: 92.66
[[100 6]
[ 2 1]]
AUC for fold 2 : 62.42
Test Accuracy for fold 3: 90.83
[[99 7]
[ 3 0]]
AUC for fold 3 : 43.08
Test Accuracy for fold 4: 88.07
[[83 8]
[ 5 13]]
AUC for fold 4 : 85.5
Test Accuracy for fold 5: 53.21
[[ 0 0]
[51 58]]
AUC for fold 5 : nan
CV test accuracy: 82.39
[[376 30]
[ 66 73]]
CV AUC: nan
CV accuracy: 82.390 +/- 14.720
CV ROC AUC: nan +/- nan
Peak accuracy: 92.66
Peak ROC AUC: nan
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:530: UndefinedMetricWarning: No negative samples in y_true, false positive value should be meaningless
UndefinedMetricWarning)
C:\Users\kkothari\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\metrics\ranking.py:95: RuntimeWarning: invalid value encountered in less
if np.any(dx < 0):
Initially I just had 17 No docs but it was working fine.. Can someone point out some mistake or explain what is going on?
Basically you have one very small class (something around 20-30 samples?) and in one of the splits you did not get any, thus leading to errors. You can use StratifiedKFold instead, which guarantees that in each split you have a constant amount of samples from each class.