Search code examples
pythonpandasscikit-learnclassificationxgboost

stratify data without train_test_split shuffle


I'm trying to do binary_classification on stock market data.

Since it is a timeseries data, I don't want to shuffle the data.

I would stratify the data without shuffling my data.

sklearn train_test_split stratify works only when the setting is shuffle=True.

[See documentation: If shuffle=False then stratify must be None.]

Is there any alternative?

Note: My model utilises xgboost algorithm.

Also Note: I don't want to use train_test_split function. I already did that manually like this.

train_df = df.iloc[0: math.floor(9 * len(df)/10)]
test_df = df.iloc[math.floor(9 * len(df)/10):]

Solution

  • Have you tried using StratifiedKFold? You can give hyperparameter shuffe =Flase It will generate indices of train and test data in number of folds

    Here is the documentation link

    https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html?highlight=stratified#sklearn.model_selection.StratifiedKFold

    This may help