I am a bit lost on building a ML classifier with imbalanced data (80:20). The dataset has 30 columns; the target is Label. I want to predict the major class. I am trying to reproduce the following steps:
What I have done is shown below:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
I have some doubts on how to perform CV on trains set, apply under sampling on a test fold and undersample the train set and train the classifier. Are you familiar with these steps? If you are, I would appreciate your help.
If I do as follows:
y = df['Label']
X = df.drop('Label',axis=1)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_test_tree = tree.predict(X_test)
y_train_tree = tree.predict(X_train)
acc_train_tree = accuracy_score(y_train,y_train_tree)
acc_test_tree = accuracy_score(y_test,y_test_tree)
# CV
scores = cross_val_score(tree,X_train, y_train,cv = 3, scoring = "accuracy")
ypred = cross_val_predict(tree,X_train,y_train,cv = 3)
print(classification_report(y_train,ypred))
accuracy_score(y_train,ypred)
confusion_matrix(y_train,ypred)
I get this output
precision recall f1-score support
-1 0.73 0.99 0.84 291
1 0.00 0.00 0.00 105
accuracy 0.73 396
macro avg 0.37 0.50 0.42 396
weighted avg 0.54 0.73 0.62 396
I guess I have missed something in the code above or doing something wrong.
Sample of data:
Have_0 Have_1 Have_2 Have_letters Label
1 0 1 1 1
0 0 0 1 -1
1 1 1 1 -1
0 1 0 0 1
1 1 0 0 1
1 0 0 1 -1
1 0 0 0 1
Generally, the best way to create a cross-validation set is to simulate your test data. In your case, if we are going to divide your data into 3 sets (train, crossv., test), the best way to do it creating sets with the same proportion of true label/false label. That's what I did in the following function.
import numpy as np
import math
X=DF[["Have_0","Have_1","Have_2","Have_letters"]]
y=DF["Label"]
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
By doing so we have sets with the following shapes (which all have the same proportion of true label/false label):