Search code examples
rcross-validationr-caret

Calculating In-sample predictive accuracy using carets' cross validation


I would like to calculate the In-sample and Out-of-sample predictive accuracy for certain metrics, all while using carets' k-fold-cross validation.

So far I have got

library(MASS)
library(leaps)
library(caret)
library(tidyverse)

full_df <- surgical

set.seed(123)
Performance_Summary <- function(data,
                                lev = NULL, 
                                model = NULL) {
                                c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
                                  MAE = mean(data$obs - data$pred))
}

train.Control <- trainControl(method = "cv", number = 10, summaryFunction = Performance_Summary)
cv_linear_model <- train(y~., data = full_df, method = "lm", trControl = train.Control)

cv_linear_model

Which should give me the average of RMSE and MAE of each of the 10 out-of-sample (test) sets.

What I would like to do now, is calculate the average RMSE and MAE for each of the 10 in-sample (training) sets.

Is this possible using the caret package? Or would I need to implement a k-fold-cross validation by hand to obtain the in-sample metrics.

Thanks for your help!


Solution

  • If you don't mind fitting the model twice, you will set the testing and training folds first, using an example dataset BostonHousing where medv is the dependent variable:

    library(mlbench)
    data(BostonHousing)
    full_df = BostonHousing[1:400,]
    
    #create folds
    set.seed(111)
    testFolds = createFolds(full_df$medv,k=10)
    
    trFolds =lapply(testFolds,function(i)setdiff(1:nrow(full_df),i))
    

    There's an error with the MAE, it should be mean of the absolute:

    Performance_Summary <- function(data,
                                    lev = NULL, 
                                    model = NULL) {
                                    c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
                                      MAE = mean(abs(data$obs - data$pred)))
    }
    

    Run for test data, like normally in caret:

    test.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=testFolds)
    
    results_test <- train(medv~., data = full_df, method = "lm", trControl = test.Control)
    
    head(results_test$resample)
          RMSE  MAE Resample
    1 4.07 3.02   Fold01
    2 4.10 3.04   Fold02
    3 5.76 4.48   Fold03
    4 4.16 2.97   Fold04
    5 4.10 3.01   Fold05
    6 6.14 4.25   Fold06
    

    Run with same training, but also test with the same index:

    train.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=trFolds)
    
    results_train <- train(medv~., data = full_df, method = "lm", trControl = train.Control)
    
    head(results_train$resample)    
    
      RMSE  MAE Resample
    1 4.80 3.35   Fold01
    2 4.80 3.32   Fold02
    3 4.63 3.19   Fold03
    4 4.79 3.29   Fold04
    5 4.80 3.31   Fold05
    6 4.57 3.18   Fold06
    

    Below is a simple implementation and you can see we get the same results. First we alter the metric function slightly:

    mets <- function(obs,pred){
                     c( 
                       RMSE = sqrt(mean((obs-pred)^2)),
                       MAE = mean(abs(obs - pred))
                      )
    }
    

    Then :

    results = lapply(1:length(testFolds),function(i){
    
    trData = full_df[trFolds[[i]],]
    testData = full_df[testFolds[[i]],]
    fit = lm(medv ~., data = trData)
    
    inSample = mets(trData$medv,fit$fitted.values)
    outSample = mets(testData$medv,predict(fit,testData))
    
    data.frame(
       folds = i,
       inSample_RMSE = inSample[1],
       inSample_MAE = inSample[2],
       outSample_RMSE = outSample[1],
       outSample_MAE = outSample[2]
       )
    })
    
    results = do.call(rbind,results)
    
              folds inSample_RMSE inSample_MAE outSample_RMSE outSample_MAE
    RMSE      1          4.80         3.35           4.07          3.02
    RMSE1     2          4.80         3.32           4.10          3.04
    RMSE2     3          4.63         3.19           5.76          4.48
    RMSE3     4          4.79         3.29           4.16          2.97
    RMSE4     5          4.80         3.31           4.10          3.01
    RMSE5     6          4.57         3.18           6.14          4.25