Search code examples
pythonpandasscikit-learnclassification

Scikit train_test_split by an index


I have a pandas dataframe indexed by date. Let's assume it from Jan-1 to Jan-30. I want to split this dataset into X_train, X_test, y_train, y_test but I don't want to mix the dates so I want the train and test samples to be divided by a certain date (or index). I'm trying

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

But when I check the values, I see the dates are mixed. I want to split my data as:

Jan-1 to Jan-24 to train and Jan-25 to Jan-30 to test (as test_size is 0.2, that makes 24 to train and 6 to test)

How can I do this?


Solution

  • you should use

    X_train, X_test, y_train, y_test = train_test_split(X,Y, shuffle=False, test_size=0.2, stratify=None)
    

    don't use random_state=None it will take numpy.random

    in here its mentioned that use shuffle=False along with stratify=None