I would like to calculate the In-sample and Out-of-sample predictive accuracy for certain metrics, all while using carets' k-fold-cross validation.
So far I have got
library(MASS)
library(leaps)
library(caret)
library(tidyverse)
full_df <- surgical
set.seed(123)
Performance_Summary <- function(data,
lev = NULL,
model = NULL) {
c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
MAE = mean(data$obs - data$pred))
}
train.Control <- trainControl(method = "cv", number = 10, summaryFunction = Performance_Summary)
cv_linear_model <- train(y~., data = full_df, method = "lm", trControl = train.Control)
cv_linear_model
Which should give me the average of RMSE and MAE of each of the 10 out-of-sample (test) sets.
What I would like to do now, is calculate the average RMSE and MAE for each of the 10 in-sample (training) sets.
Is this possible using the caret package? Or would I need to implement a k-fold-cross validation by hand to obtain the in-sample metrics.
Thanks for your help!
If you don't mind fitting the model twice, you will set the testing and training folds first, using an example dataset BostonHousing
where medv
is the dependent variable:
library(mlbench)
data(BostonHousing)
full_df = BostonHousing[1:400,]
#create folds
set.seed(111)
testFolds = createFolds(full_df$medv,k=10)
trFolds =lapply(testFolds,function(i)setdiff(1:nrow(full_df),i))
There's an error with the MAE, it should be mean of the absolute:
Performance_Summary <- function(data,
lev = NULL,
model = NULL) {
c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
MAE = mean(abs(data$obs - data$pred)))
}
Run for test data, like normally in caret:
test.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=testFolds)
results_test <- train(medv~., data = full_df, method = "lm", trControl = test.Control)
head(results_test$resample)
RMSE MAE Resample
1 4.07 3.02 Fold01
2 4.10 3.04 Fold02
3 5.76 4.48 Fold03
4 4.16 2.97 Fold04
5 4.10 3.01 Fold05
6 6.14 4.25 Fold06
Run with same training, but also test with the same index:
train.Control <- trainControl(method = "cv", summaryFunction = Performance_Summary,index=trFolds,indexOut=trFolds)
results_train <- train(medv~., data = full_df, method = "lm", trControl = train.Control)
head(results_train$resample)
RMSE MAE Resample
1 4.80 3.35 Fold01
2 4.80 3.32 Fold02
3 4.63 3.19 Fold03
4 4.79 3.29 Fold04
5 4.80 3.31 Fold05
6 4.57 3.18 Fold06
Below is a simple implementation and you can see we get the same results. First we alter the metric function slightly:
mets <- function(obs,pred){
c(
RMSE = sqrt(mean((obs-pred)^2)),
MAE = mean(abs(obs - pred))
)
}
Then :
results = lapply(1:length(testFolds),function(i){
trData = full_df[trFolds[[i]],]
testData = full_df[testFolds[[i]],]
fit = lm(medv ~., data = trData)
inSample = mets(trData$medv,fit$fitted.values)
outSample = mets(testData$medv,predict(fit,testData))
data.frame(
folds = i,
inSample_RMSE = inSample[1],
inSample_MAE = inSample[2],
outSample_RMSE = outSample[1],
outSample_MAE = outSample[2]
)
})
results = do.call(rbind,results)
folds inSample_RMSE inSample_MAE outSample_RMSE outSample_MAE
RMSE 1 4.80 3.35 4.07 3.02
RMSE1 2 4.80 3.32 4.10 3.04
RMSE2 3 4.63 3.19 5.76 4.48
RMSE3 4 4.79 3.29 4.16 2.97
RMSE4 5 4.80 3.31 4.10 3.01
RMSE5 6 4.57 3.18 6.14 4.25