Search code examples
rmachine-learningtime-seriespartitioningr-caret

Train and test splits by unique dates, not observations


I am trying to train a model with random forest in R. I have a timeseries containing information on multiple stocks per date, and have created a very simplified version of it:

Date <- rep(seq(as.Date("2009/01/01"), by = "day", length.out = 100), 10)
Name <- c(rep("Stock A", 100), rep("Stock B",100), rep("Stock C", 100), rep("Stock D", 100), rep("Stock E",100), rep("Stock F",100), rep("Stock G",100), rep("Stock H",100), rep("Stock I", 100), rep("Stock J", 100))
Class <- sample(1:10, 1000, replace=TRUE)

DF <- data.frame(Date, Name, Class)
DF <- DF %>% arrange(Date, Name)

Looks something like this:

        Date   Name    Class
1  2009-01-01  Stock A     5
2  2009-01-01  Stock B     2
3  2009-01-01  Stock C     4
4  2009-01-01  Stock D    10
5  2009-01-01  Stock E     7
6  2009-01-01  Stock F     3
...
11 2009-01-02  Stock A    10
12 2009-01-02  Stock B     8 
13 2009-01-02  Stock C     9

When using trainControl for the splitting of the data into training and testing periods, the split is done on the basis of each observation, but i would like to have it done based on unique days. What i have done until now is this:

timecontrol <- DF %>% group_by(Date) %>% trainControl(
  method            = 'timeslice',
  initialWindow     = 10,
  horizon           = 5,
  skip              = 4,
  fixedWindow       = TRUE,
  returnData        = TRUE, 
  classProbs        = TRUE
)

fitRF <- train(Class ~ ., 
               data = DF,
               method = "ranger",
               tuneGrid = tunegrid,
               na.action = na.omit,
               trControl = timecontrol)

This gives me a training set of 10 observations, followed by 5 testing observations. I would, however, like to have a training set(and testing..) containing all observations in 10 unique days, so that one training set would be 10 days times the number of observations each day, and with a skip between periods so that each testing period is on entirely new data (hence skip=4).

The first training/test split should be training=10 first unique days of data set, test=following 5 unique days, and then the second training/test split should be so that test set number 2 is the 5 days directly after the first test set.

Unlike the dataset i have shown above, my dataset contains different amounts of observations per day. My dataset contains 417497 observations, but only 2482 unique dates, so being able to make the training/testing splits based on the "grouped" dates makes a big difference.

Is there some way i can use trainControl and get the split that i need, or will I have to manually split all my data?


Solution

  • If I understand correctly your goal is to create block time series cross validation with dates as blocks.

    One approach is to use createTimeSlices on the unique dates (in order) and then map that back to your data set:

    dates <- unique(DF$Date) #already in order
    
    
    slices <- createTimeSlices(dates,
                               initialWindow = 10,
                               horizon = 5,
                               skip = 4,
                               fixedWindow = TRUE)
    

    map back these slices to the indexes in your original data:

    slices <- lapply(slices, function(x){
      lapply(x, function(k){
        DF %>%
          mutate(n = 1:n()) %>%
          filter(Date %in% dates[k]) %>%
          pull(n)
      })
    })
    

    so the first train data frame will be:

    DF[slices$train[[1]],]
    

    while the testing data will be:

    DF[slices$test[[1]],]
    

    now when defining trainControl use the obtained train and test indexes:

    tr <- trainControl(returnData = TRUE, 
                       classProbs = TRUE,
                       index = slices$train,
                       indexOut = slices$test)
    

    data:

    Date <- rep(seq(as.Date("2009/01/01"), by = "day", length.out = 100), 10)
    Name <- c(rep("Stock A", 100), rep("Stock B",100), rep("Stock C", 100), rep("Stock D", 100), rep("Stock E",100), rep("Stock F",100), rep("Stock G",100), rep("Stock H",100), rep("Stock I", 100), rep("Stock J", 100))
    Class <- sample(1:10, 1000, replace=TRUE)
    
    DF <- data.frame(Date, Name, Class)
    DF <- DF %>% arrange(Date, Name)