Search code examples
rr-caret

R caret package: data partition into training / test sets before trainControl?


I see a lot of R codes where a full dataset is first split into a training set and a test set:

library(caret)
library(klaR)

# load the iris dataset
data(iris)

# define a 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.8, list=FALSE)
data_train <- iris[trainIndex,]
data_test <- iris[-trainIndex,]

In a second time, a partition method is defined such as repeated k-fold cross validation:

train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

Then a model is trained using the training set:

my_model <- train(Species~., data=data_train, trControl=train_control, method="nb")

Finally, predictions are performed on the test set:

pred_results <- predict(my_model, newdata=data_test)

When using specifically a (repeated) k-fold cross validation method, it seems to me that the training (n=k-1 folds ) and the test (n=1 fold) sets are already inherently defined.

In this case why adding an extra layer of partition by splitting first the full dataset into 80% training and 20% test sets? Is it necessary?


Solution

  • In chapter 2.2 from Introduction to statistical learning, available here

    In general, we do not really care how well the method works training on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data

    Read the whole chapter, including bias / variance trade off.

    tldr; You need to test your trained algorithm on unseen data to see how well it performs. If you include your test data in the training (10-fold cv or not), your algorithm has seen these cases. You will be too confident with your predictions.