I'm using StratifiedShuffleSplit
to repeat the procedure of split dataset, fit, predict, and compute metric. Could you please explan why it gives the same result for every split?
import csv
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report
clf = RandomForestClassifier(max_depth = 5)
df = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/BigData/main/cll_dataset.csv")
X, y = df.iloc[:, 1:], df.iloc[:, 0]
sss = StratifiedShuffleSplit(n_splits = 5, test_size = 0.25, random_state = 0).split(X, y)
for train_ind, test_ind in sss:
X_train, X_test = X.loc[train_ind], X.loc[test_ind]
y_train, y_test = y.loc[train_ind], y.loc[test_ind]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred, zero_division = 0, output_dict = True)
report = pd.DataFrame(report).T
report = report[:2]
print(report)
The result is
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
precision recall f1-score support
0 0.75 1.0 0.857143 6.0
1 0.00 0.0 0.000000 2.0
Every model you build, predicts that the output is always class 0, and, as you have stratified split (always have same proportion of class 0 and class 1 than the X), you always predict exactly same values.
The models obtain better accuracy when they predicts always class 0, than "learning" some pattern or rule. This is a huge problem. To solve it you have some options below:
SMOTE
library