I need to split the data in a cross-validation structured sequentially, such as:
fold-1 with observations with index from 1 to 10, fold-2 with observations with index from 11 to 20 and so on...
Does any of the methods in trainControl()
from caret
doing it sequentially? I suppose the "cv" method split the data in this way, but nothing very clear in the caret's documentation to guarantee that.
You can provide the folds, using indexOut=
argument. check out the help page. Below I use iris as an example, i cannot run it sequentially because the data is ordered by Species, so i randomised it first:
library(caret)
dat = iris[sample(nrow(iris)),]
I create the folds, below is based on a 10 fold cross validation, so each fold takes in 1/10 of the total number of rows:
idx = (1:nrow(dat) - 1) %/% (nrow(dat) / 10)
Folds = split(1:nrow(dat),idx)
We can look at the assignment of the indices:
Folds[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Folds[[2]]
[1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Then run train()
with this:
model = train(Species ~.,method="rf",data=dat,
trControl=trainControl(method="cv",indexOut=Folds))
model
Random Forest
150 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 1.0000000 1.0000000
3 1.0000000 1.0000000
4 0.9933333 0.9895833
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.