I have a data which include dates in sorted order.
I would like to split the given data to train and test set. However, I must to split the data in a way that the test have to be newer than the train set.
Please look at the given example:
Let's assume that we have data by dates:
1, 2, 3, ..., n.
The numbers from 1 to n represents the days.
I would like to split it to 20% from the data to be train set and 80% of the data to be test set.
Good results:
1) train set = 1, 2, 3, ..., 20
test set = 21, ..., 100
2) train set = 101, 102, ... 120
test set = 121, ... 200
My code:
train_size = 0.2
train_dataframe, test_dataframe = cross_validation.train_test_split(features_dataframe, train_size=train_size)
train_dataframe = train_dataframe.sort(["date"])
test_dataframe = test_dataframe.sort(["date"])
Does not work for me!
Any suggestions?
If you insist that all testing data be newer than all training data, then there is only one way to accomplish the desired 20/80 split.
n = features_dataframe.shape[0]
train_size = 0.2
features_dataframe = features_dataframe.sort_values('date')
train_dataframe = features_dataframe.iloc[:int(n * train_size)]
test_dataframe = features_dataframe.iloc[int(n * train_size):]
And there is nothing random about it.