Search code examples
pythonpython-3.xscikit-learnrandom-forestprediction

Why does "StratifiedShuffleSplit" give the same result for every split of dataset?


I'm using StratifiedShuffleSplit to repeat the procedure of split dataset, fit, predict, and compute metric. Could you please explan why it gives the same result for every split?

import csv
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import classification_report

clf = RandomForestClassifier(max_depth = 5)
df = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/BigData/main/cll_dataset.csv")
X, y = df.iloc[:, 1:], df.iloc[:, 0]
sss = StratifiedShuffleSplit(n_splits = 5, test_size = 0.25, random_state = 0).split(X, y)

for train_ind, test_ind in sss:
    X_train, X_test = X.loc[train_ind], X.loc[test_ind]
    y_train, y_test = y.loc[train_ind], y.loc[test_ind]
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    report = classification_report(y_test, y_pred, zero_division = 0, output_dict = True)
    report = pd.DataFrame(report).T
    report = report[:2]
    print(report)

The result is

   precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
   precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
   precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
   precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0
   precision  recall  f1-score  support
0       0.75     1.0  0.857143      6.0
1       0.00     0.0  0.000000      2.0

Solution

  • Every model you build, predicts that the output is always class 0, and, as you have stratified split (always have same proportion of class 0 and class 1 than the X), you always predict exactly same values.

    The models obtain better accuracy when they predicts always class 0, than "learning" some pattern or rule. This is a huge problem. To solve it you have some options below:

    • Try modifying some hyperparameters of the Random Forest algorithm.
    • Collect more data in order to obtain a bigger dataset, you only test with 8 samples (maybe is difficult obtain new data for you)
    • You have unbalanced data (more samples of class 0 than class 1), you should consider balancing it using SMOTE library