Search code examples
pythonmachine-learningscikit-learncross-validationsampling

CV and under sampling on a test fold


I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label. I want to predict the major class. I am trying to reproduce the following steps:

  • Split the data on train/test
  • Perform CV on trains set
  • Apply undersampling only on a test fold
  • After the model has been chosen with the help of CV, undersample the train set and train the classifier
  • Estimate the performance on the untouched test set (recall)

What I have done is shown below:

    y = df['Label']
    X = df.drop('Label',axis=1)
    X.shape, y.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
    X_train.shape, X_test.shape

    tree = DecisionTreeClassifier(max_depth = 5)

    tree.fit(X_train, y_train)

    y_test_tree = tree.predict(X_test)
    y_train_tree = tree.predict(X_train)

    acc_train_tree = accuracy_score(y_train,y_train_tree)
    acc_test_tree = accuracy_score(y_test,y_test_tree)

I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier. Are you familiar with these steps? If you are, I would appreciate your help.

If I do as follows:

y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape

tree = DecisionTreeClassifier(max_depth = 5)

tree.fit(X_train, y_train)

y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)

acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)

print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)

I get this output

             precision    recall  f1-score   support

      -1       0.73      0.99      0.84       291
       1       0.00      0.00      0.00       105

accuracy                           0.73       396
macro avg       0.37      0.50      0.42       396
weighted avg       0.54      0.73      0.62       396

I guess I have missed something in the code above or doing something wrong.

Sample of data:

Have_0 Have_1 Have_2 Have_letters Label
1        0      1         1         1
0        0      0         1        -1 
1        1      1         1        -1
0        1      0         0         1
1        1      0         0         1
1        0      0         1        -1
1        0      0         0         1

Solution

  • Generally, the best way to create a cross-validation set is to simulate your test data. In your case, if we are going to divide your data into 3 sets (train, crossv., test), the best way to do it creating sets with the same proportion of true label/false label. That's what I did in the following function.

    import numpy as np
    import math
    X=DF[["Have_0","Have_1","Have_2","Have_letters"]]
    y=DF["Label"]
     
    
    
     
    def create_cv(X,y):
        if type(X)!=np.ndarray:
            X=X.values
            y=y.values
     
        test_size=1/5
        proportion_of_true=y[y==1].shape[0]/y.shape[0]
        num_test_samples=math.ceil(y.shape[0]*test_size)
        num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
        num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
        
        y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
        y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
    
    
        
        X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
        X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
        return X_train,X_test,y_train,y_test
    
        
        
    X_train,X_test,y_train,y_test=create_cv(X,y)
    X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
        
    

    By doing so we have sets with the following shapes (which all have the same proportion of true label/false label):

    enter image description here