My question is pretty simple but I can't find a clear cut answer using caret package doc. If I use the preprocessing options center and scale in my train function, it is stated that the same preprocesing will be applied to new data set while doing predictions.
So when I use the predict function: Does it mean that mean and scale of the training set is applied to the new data? Or a new centering and scaling is applied to the new data set, thus potentially using points in the future if the data are timeseries (which is problematic)?
Thank you
caret::predict.train
uses parameters from the model you built to predict on the test set.
Here is a snippet from the source code that shows the preProc data comes from the object's preProcess parameters:
out <- predictionFunction(method = object$modelInfo,
modelFit = object$finalModel, newdata = newdata,
preProc = object$preProcess)
You can see these parameters for yourself after creating your model by accessing object$preProcess
.
Here is a complete example:
rm(list=ls())
library(caret)
set.seed(4444)
data(mtcars)
inTrain <- createDataPartition(y=mtcars$mpg,p=0.75,list=FALSE)
training <- mtcars[inTrain,]
testing <- mtcars[-inTrain,]
lmFit <- train(mpg~.,data=training,method="lm",preProc=c("center","scale"))
lmFit$preProcess