Search code examples
rregressionlinear-regressioncross-validationmanual-testing

Calculating cross validation manually gives different result


Let's take data:

set.seed(42)
y <- rnorm(125)
x <- data.frame(runif(125), rexp(125))

I want to perform 2 - fold cross validation on it. So :

library(caret)
model <- train(y ~ .,
  data = cbind(y, x), method = "lm",
  trControl = trainControl(method = "cv", number = 2)
)
model 

Linear Regression 

125 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 63, 62 
Resampling results:

  RMSE      Rsquared     MAE      
  1.091108  0.002550859  0.8472947

Tuning parameter 'intercept' was held constant at a value of TRUE

I want to obtain this RMSE value above manually to be sure that I perfectly understand cross validation.

My work so far

As I can see above, my sample was divided into parts : 62 (1 fold) and 63 (second fold).

#Training first model basing on first fold
model_1 <- lm(y[1:63] ~ ., data = x[1:63, ])
#Calculating RMSE for the first model
RMSE_1 <- RMSE(y[64:125], predict(model_1, newdata = x[64:125, ]))
#Training second model basing on second fold
model_2 <- lm(y[64:125] ~ ., data = x[64:125, ])
#Calculating RMSE for the second model
RMSE_2 <- RMSE(y[1:63], predict(model_1, newdata = x[1:63, ]))
mean(c(RMSE_1, RMSE_2))
 1.023411

And my question is - why I got different RMSE ? This error is to big to be treated as estimate error - for sure they are calculating it in another way. Do you have any idea what I'm doing differently ?


Solution

  • the logic you are using is right, but there are two changes you need to make:

    1. Caret will create its own 2 folds of data for training. It won't be 1:63, 64:125, but caret generates them based on the seed
    2. There's a typo in RMSE_2 where it should be model_2

    Here is the updated code:

    # the folds are kept in this part of the output (trial and error to find it haha)
    model$control$index
    f1 <- model$control$index[[1]]
    f2 <- model$control$index[[2]]
    
    # re-do your calculations but using the fold indexes, plus typo for RMSE_2
    model_1 <- lm(y[f1] ~ ., data = x[f1, ])
    #Calculating RMSE for the first model
    RMSE_1 <- RMSE(y[f2], predict(model_1, newdata = x[f2, ]))
    #Training second model basing on second fold
    model_2 <- lm(y[f2] ~ ., data = x[f2, ])
    #Calculating RMSE for the second model
    RMSE_2 <- RMSE(y[f1], predict(model_2, newdata = x[f1, ]))
    
    # matches now
    mean(c(RMSE_1, RMSE_2))