I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.