Search code examples
pandasscikit-learncross-validationsklearn-pandas

Scikit learn split train test for series


I have a data which include dates in sorted order.

I would like to split the given data to train and test set. However, I must to split the data in a way that the test have to be newer than the train set.

Please look at the given example:

Let's assume that we have data by dates:

1, 2, 3, ..., n.

The numbers from 1 to n represents the days.

I would like to split it to 20% from the data to be train set and 80% of the data to be test set.

Good results:

1) train set = 1, 2, 3, ..., 20

   test set = 21, ..., 100


2) train set = 101, 102, ... 120

    test set = 121, ... 200

My code:

train_size = 0.2
train_dataframe, test_dataframe = cross_validation.train_test_split(features_dataframe, train_size=train_size)                          

train_dataframe = train_dataframe.sort(["date"])
test_dataframe = test_dataframe.sort(["date"])

Does not work for me!

Any suggestions?


Solution

  • If you insist that all testing data be newer than all training data, then there is only one way to accomplish the desired 20/80 split.

    n = features_dataframe.shape[0]
    train_size = 0.2
    
    features_dataframe = features_dataframe.sort_values('date')
    train_dataframe = features_dataframe.iloc[:int(n * train_size)]
    test_dataframe = features_dataframe.iloc[int(n * train_size):]
    

    And there is nothing random about it.