Search code examples
rfor-looppredictsliding-window

Generating sliding window to subset data for prediction task


I want to write a sliding window function in order to use the model trained from t, t+1, and t+2 year to make prediction on the outcome of the t+3 year. This means that for a 10-year's data, the desired sliding window function should create 7 train-test splits and make 7 predictions (for the t+3, t+4, t+5, t+6, t+7, t+8, t+9 year).

I came up with the following code but the result doesn't ring the bell. Not only does the resulting object length differs, but even if I try to manually work through the prediction task, the predict function actually generates 3 predicted values for a single year's outcome, which doesn't make sense. It would be grateful if someone could point out the sources of the error.

# generate the data
set.seed(123)
df <- data.frame(year = 2000:2009,  # T = 10
           y = c(1, 1, 1, 1, 0, 0, 1, 0, 0, 0), 
           var1 = runif(10, min=0, max=1), 
           var2 = runif(10, min=1, max=2))

# store predicted values in a list
pred <- list()

# loop from the 1st year to the T-3 year

for(i in 2000:2007){
 df_sub1 <- subset(df, year == c(i, i+1, i+2)) 
 mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
 df_sub2 <- subset(df, year == i+3)
 pred[[i]] <- predict(mod, data=df_sub2, type = "response")
}


# error message
Error in family$linkfun(mustart) : 
  Argument mu must be a nonempty numeric vector
In addition: Warning messages:
1: In year == c(i, i + 1, i + 2) :
  longer object length is not a multiple of shorter object length
2: In year == c(i, i + 1, i + 2) :
  longer object length is not a multiple of shorter object length


Solution

  • The error/warning is from using == when the rhs is of length > 1. Use %in%

    pred <- vector('list', 8)
    names(pred) <- 2000:2007
    for(i in 2000:2007){
     df_sub1 <- subset(df, year %in% c(i, i+1, i+2)) 
     mod <- glm(y~var1+var2, data=df_sub1, family=binomial())
     df_sub2 <- subset(df, year == (i+3))
     pred[[as.character(i)]] <- tryCatch(predict(mod,
         newdata=df_sub2, type = "response"), error = function(e) NA_real_)
    }
    

    -output

    > pred
    $`2000`
    4 
    1 
    
    $`2001`
    5 
    1 
    
    $`2002`
    6 
    1 
    
    $`2003`
               7 
    2.220446e-16 
    
    $`2004`
            8 
    0.1467543 
    
    $`2005`
              9 
    0.001408577 
    
    $`2006`
              10 
    2.220446e-16 
    
    $`2007`
    [1] NA