Calculating In-sample predictive accuracy using carets' cross validation

I would like to calculate the In-sample and Out-of-sample predictive accuracy for certain metrics, all while using carets' k-fold-cross validation.

So far I have got

library(MASS)
library(leaps)
library(caret)
library(tidyverse)

full_df <- surgical

set.seed(123)
Performance_Summary <- function(data,
                                lev = NULL, 
                                model = NULL) {
                                c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
                                  MAE = mean(data$obs - data$pred))
}

train.Control <- trainControl(method = "cv", number = 10, summaryFunction = Performance_Summary)
cv_linear_model <- train(y~., data = full_df, method = "lm", trControl = train.Control)

cv_linear_model

Which should give me the average of RMSE and MAE of each of the 10 out-of-sample (test) sets.

What I would like to do now, is calculate the average RMSE and MAE for each of the 10 in-sample (training) sets.

Is this possible using the caret package? Or would I need to implement a k-fold-cross validation by hand to obtain the in-sample metrics.

Thanks for your help!

Solution

If you don't mind fitting the model twice, you will set the testing and training folds first, using an example dataset BostonHousing where medv is the dependent variable:

library(mlbench)
data(BostonHousing)
full_df = BostonHousing[1:400,]

#create folds
set.seed(111)
testFolds = createFolds(full_df$medv,k=10)

trFolds =lapply(testFolds,function(i)setdiff(1:nrow(full_df),i))

There's an error with the MAE, it should be mean of the absolute:

Performance_Summary <- function(data,
                                lev = NULL, 
                                model = NULL) {
                                c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
                                  MAE = mean(abs(data$obs - data$pred)))
}

Run for test data, like normally in caret:

test.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=testFolds)

results_test <- train(medv~., data = full_df, method = "lm", trControl = test.Control)

head(results_test$resample)
      RMSE  MAE Resample
1 4.07 3.02   Fold01
2 4.10 3.04   Fold02
3 5.76 4.48   Fold03
4 4.16 2.97   Fold04
5 4.10 3.01   Fold05
6 6.14 4.25   Fold06

Run with same training, but also test with the same index:

train.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=trFolds)

results_train <- train(medv~., data = full_df, method = "lm", trControl = train.Control)

head(results_train$resample)    

  RMSE  MAE Resample
1 4.80 3.35   Fold01
2 4.80 3.32   Fold02
3 4.63 3.19   Fold03
4 4.79 3.29   Fold04
5 4.80 3.31   Fold05
6 4.57 3.18   Fold06

Below is a simple implementation and you can see we get the same results. First we alter the metric function slightly:

mets <- function(obs,pred){
                 c( 
                   RMSE = sqrt(mean((obs-pred)^2)),
                   MAE = mean(abs(obs - pred))
                  )
}

Then :

results = lapply(1:length(testFolds),function(i){

trData = full_df[trFolds[[i]],]
testData = full_df[testFolds[[i]],]
fit = lm(medv ~., data = trData)

inSample = mets(trData$medv,fit$fitted.values)
outSample = mets(testData$medv,predict(fit,testData))

data.frame(
   folds = i,
   inSample_RMSE = inSample[1],
   inSample_MAE = inSample[2],
   outSample_RMSE = outSample[1],
   outSample_MAE = outSample[2]
   )
})

results = do.call(rbind,results)

          folds inSample_RMSE inSample_MAE outSample_RMSE outSample_MAE
RMSE      1          4.80         3.35           4.07          3.02
RMSE1     2          4.80         3.32           4.10          3.04
RMSE2     3          4.63         3.19           5.76          4.48
RMSE3     4          4.79         3.29           4.16          2.97
RMSE4     5          4.80         3.31           4.10          3.01
RMSE5     6          4.57         3.18           6.14          4.25