Search code examples
rpredictionr-caretforecastingtraining-data

Does predict function in caret package use future information when preprocessing?


My question is pretty simple but I can't find a clear cut answer using caret package doc. If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.

So when I use the predict function: Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?

Thank you


Solution

  • caret::predict.train uses parameters from the model you built to predict on the test set.

    Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:

    out <- predictionFunction(method = object$modelInfo, 
                modelFit = object$finalModel, newdata = newdata, 
                preProc = object$preProcess)
    

    You can see these parameters for yourself after creating your model by accessing object$preProcess. Here is a complete example:

    rm(list=ls())
    library(caret)
    set.seed(4444)
    
    data(mtcars)
    inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
    training <- mtcars[inTrain,]
    testing <- mtcars[-inTrain,]
    
    lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))
    lmFit$preProcess