python machine-learning scikit-learn cross-validation sampling

CV and under sampling on a test fold

I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label. I want to predict the major class. I am trying to reproduce the following steps:

Split the data on train/test
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set (recall)

What I have done is shown below:

    y = df['Label']
    X = df.drop('Label',axis=1)
    X.shape, y.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
    X_train.shape, X_test.shape

    tree = DecisionTreeClassifier(max_depth = 5)

    tree.fit(X_train, y_train)

    y_test_tree = tree.predict(X_test)
    y_train_tree = tree.predict(X_train)

    acc_train_tree = accuracy_score(y_train,y_train_tree)
    acc_test_tree = accuracy_score(y_test,y_test_tree)

I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier. Are you familiar with these steps? If you are, I would appreciate your help.

If I do as follows:

y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape

tree = DecisionTreeClassifier(max_depth = 5)

tree.fit(X_train, y_train)

y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)

acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)

print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)

I get this output

             precision    recall  f1-score   support

      -1       0.73      0.99      0.84       291
       1       0.00      0.00      0.00       105

accuracy                           0.73       396
macro avg       0.37      0.50      0.42       396
weighted avg       0.54      0.73      0.62       396

I guess I have missed something in the code above or doing something wrong.

Sample of data:

Have_0 Have_1 Have_2 Have_letters Label
1        0      1         1         1
0        0      0         1        -1 
1        1      1         1        -1
0        1      0         0         1
1        1      0         0         1
1        0      0         1        -1
1        0      0         0         1

Solution

Generally, the best way to create a cross-validation set is to simulate your test data. In your case, if we are going to divide your data into 3 sets (train, crossv., test), the best way to do it creating sets with the same proportion of true label/false label. That's what I did in the following function.

import numpy as np
import math
X=DF[["Have_0","Have_1","Have_2","Have_letters"]]
y=DF["Label"]
 


 
def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])


    
    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

    
    
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)

By doing so we have sets with the following shapes (which all have the same proportion of true label/false label):