Search code examples
rsplitcross-validationr-caretsampling

How does caret split the data in trainControl?


I need to split the data in a cross-validation structured sequentially, such as:

fold-1 with observations with index from 1 to 10, fold-2 with observations with index from 11 to 20 and so on...

Does any of the methods in trainControl() from caret doing it sequentially? I suppose the "cv" method split the data in this way, but nothing very clear in the caret's documentation to guarantee that.


Solution

  • You can provide the folds, using indexOut= argument. check out the help page. Below I use iris as an example, i cannot run it sequentially because the data is ordered by Species, so i randomised it first:

    library(caret)
    dat = iris[sample(nrow(iris)),]
    

    I create the folds, below is based on a 10 fold cross validation, so each fold takes in 1/10 of the total number of rows:

    idx = (1:nrow(dat) - 1) %/% (nrow(dat) / 10)
    Folds = split(1:nrow(dat),idx)
    

    We can look at the assignment of the indices:

    Folds[[1]]
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
    
    Folds[[2]]
     [1] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    

    Then run train() with this:

    model = train(Species ~.,method="rf",data=dat,
    trControl=trainControl(method="cv",indexOut=Folds))
    
    
    model
    Random Forest 
    
    150 samples
      4 predictor
      3 classes: 'setosa', 'versicolor', 'virginica' 
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
    Resampling results across tuning parameters:
    
      mtry  Accuracy   Kappa    
      2     1.0000000  1.0000000
      3     1.0000000  1.0000000
      4     0.9933333  0.9895833
    
    Accuracy was used to select the optimal model using the largest value.
    The final value used for the model was mtry = 2.