Search code examples
rmachine-learningr-caretknn

How do I avoid time leakage in my KNN model?


I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.

Data -

# A tibble: 81,334 x 4
   latitude longitude close_date          close_price
      <dbl>     <dbl> <dttm>                    <dbl>
 1     36.4     -98.7 2014-08-05 06:34:00     147504.
 2     36.6     -97.9 2014-08-12 23:48:00     137401.
 3     36.6     -97.9 2014-08-09 04:00:40     239105.

Model -

library(caret)
training.samples <- data$close_price %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

model <- train(
  close_price~ ., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.

I want to apply a rule to the model that says, for each value y, only use houses that closed before the house for that y. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.

Is it possible to prevent this time leakage, either in caret or other libraries for knn (like class and kknn)?


Solution

  • In caret, createTimeSlices implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin). Documentation is here.

    In your case, depending on your precise needs, you could use something like this for a proper cross-validation:

    your_data <- your_data %>% arrange(close_date)
    
    tr_ctrl <- createTimeSlices(
      your_data$close_price, 
      initialWindow  = 10, 
      horizon = 1,
      fixedWindow = FALSE)
    
    model <- train(
      close_price~ ., data = your_data, method = "knn",
      trControl = tr_ctrl,
      preProcess = c("center", "scale"),
      tuneLength = 10
    )
    

    EDIT: if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl before using it in train:

    filter_train <- function(i_tr, i_te) {
      d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
      d_te <- as_date(your_data$close_date[i_te])
      tr_is_ok <- d_tr < min(d_te)
    
      i_tr[tr_is_ok]
    }
    
    tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)