python pandas scikit-learn time-series cross-validation

How to create a train/test split of time-series data by year?

I want to cross-validate my time-series data and split by the year of the timestamp.

Here is the following data in a pandas dataframe:

mock_data

timestamp             counts
'2015-01-01 03:45:14' 4
     .
     .
     .
'2016-01-01 13:02:14' 12
     .
     .
     .
'2017-01-01 09:56:54' 6
     .
     .
     .
'2018-01-01 13:02:14' 8
     .
     .
     .
'2019-01-01 11:39:40' 24
     .
     .
     .
'2020-01-01 04:02:03' 30

mock_data.dtypes
timestamp object
counts    int64

Looking into the TimeSeriesSplit() function of scikit-learn, it does not appear that you can specify the n_split part by year. Is there another way that one can create successive training sets that result in the following train-test split?

tscv = newTimeSeriesSplit(n_splits=5, by='year')
>>> print(tscv)  
newTimeSeriesSplit(max_train_size=None, n_splits=5, by='year')
>>> for train_index, test_index in tscv.split(mock_data):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [2015] TEST: [2016]
TRAIN: [2015 2016] TEST: [2017]
TRAIN: [2015 2016 2017] TEST: [2018]
TRAIN: [2015 2016 2017 2018] TEST: [2019]
TRAIN: [2015 2016 2017 2018 2019] TEST: [2020]

Thanks for viewing!

Solution

Updated Response

Generic approach for data with arbitrary number of points in each year.

First, some data with a few years of data with differing numbers of points in each, per the example. This is similar in approach to the original answer.

import numpy as np
import pandas as pd

ts_2015 = pd.date_range('2015-01-01', '2015-12-31', periods=4).to_series()
ts_2016 = pd.date_range('2016-01-01', '2016-12-31', periods=12).to_series()
ts_2017 = pd.date_range('2017-01-01', '2017-12-31', periods=6).to_series()
ts_2018 = pd.date_range('2018-01-01', '2018-12-31', periods=8).to_series()
ts_2019 = pd.date_range('2019-01-01', '2019-12-31', periods=24).to_series()
ts_2020 = pd.date_range('2020-01-01', '2020-12-31', periods=30).to_series()
ts_all = pd.concat([ts_2015, ts_2016, ts_2017, ts_2018, ts_2019, ts_2020])

df = pd.DataFrame({'X': np.random.randint(0, 100, size=ts_all.shape), 
                   'Y': np.random.randint(100, 200, size=ts_all.shape)},
                 index=ts_all)
df['year'] = df.index.year
df = df.reset_index()

Now we create a list of the unique years to iterate over and a dict to store the various split dataframes.

year_list = df['year'].unique().tolist()
splits = {'train': [], 'test': []}

for idx, yr in enumerate(year_list[:-1]):
    train_yr = year_list[:idx+1]
    test_yr = [year_list[idx+1]]
    print('TRAIN: ', train_yr, 'TEST: ',test_yr)
    
    splits['train'].append(df.loc[df.year.isin(train_yr), :])
    splits['test'].append(df.loc[df.year.isin(test_yr), :])

Result:

TRAIN:  [2015] TEST:  [2016]
TRAIN:  [2015, 2016] TEST:  [2017]
TRAIN:  [2015, 2016, 2017] TEST:  [2018]
TRAIN:  [2015, 2016, 2017, 2018] TEST:  [2019]
TRAIN:  [2015, 2016, 2017, 2018, 2019] TEST:  [2020]

The split dataframes would look something like the following:

>>> splits['train'][0]

                index   X    Y  year
0 2015-01-01 00:00:00  20  127  2015
1 2015-05-02 08:00:00  25  197  2015
2 2015-08-31 16:00:00  61  185  2015
3 2015-12-31 00:00:00  75  144  2015

Original Response

It was pointed out to me that this approach would not work because it assumes that each year contains the same number of records.

Your intent is a little unclear, but I believe you want to do is to pass a dataframe with a timestamp index into a new version of the TimeSeriesSplit class that will yield n_split = n_years - 1 based on the number of years in your data. The TimeSeriesSplit class gives you the flexibility to do this, but you need to extract the year from your timestamp index first. The result doesn't quite look like what you've proposed, but the outcome is, I believe, what you want.

First some dummy data:

import numpy as np
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
    
ts_index = pd.date_range('2015-01-01','2020-12-31',freq='M')
df = pd.DataFrame({'X': np.random.randint(0, 100, size=ts_index.shape), 
                   'Y': np.random.randint(100, 200, size=ts_index.shape)},
                 index=ts_index)

Now a year for the TimeSeriesSplit to work on. Because we have to index into this thing by row number and pd.ix is deprecated, I reset the index from timestamp to numerical:

df['year'] = df.index.year
df = df.reset_index()

And then a TimeSeriesSplit instance with the correct number of splits (n_years - 1):

tscv = TimeSeriesSplit(n_splits=len(df['year'].unique()) - 1)

Now we can generate the indices. Instead of printing the indices, print the year column that corresponds and only print the unique years:

for train_idx, test_idx in tscv.split(df['year']):
    print('TRAIN: ', df.loc[df.index.isin(train_idx), 'year'].unique(), 
          'TEST: ', df.loc[df.index.isin(test_idx), 'year'].unique())

TRAIN:  [2015] TEST:  [2016]
TRAIN:  [2015 2016] TEST:  [2017]
TRAIN:  [2015 2016 2017] TEST:  [2018]
TRAIN:  [2015 2016 2017 2018] TEST:  [2019]
TRAIN:  [2015 2016 2017 2018 2019] TEST:  [2020]

You would of course access your training/test sets in a similar manner. If you really wanted to button this up nicely, you could extend the TimeSeriesSplit class and either customize the initialization or add some new methods.