r machine-learning r-caret training-data

r caret estimate parameters on a subset fit to full data

I have a dataset of 550k items that I split 500k for training and 50k for testing. During the training stage it is necessary to establish the 'best' combination of each algorithms' parameter values. Rather than use the entire 500k for this I'd be happy to use a subset, BUT when it comes to training the final model, with the 'best' combination, I'd like to use the full 500k. In pseudo code the task looks like:

subset the 500k training data to 50k
for each combination of model parameters (3, 6, or 9)
  for each repeat (3)
    for each fold (10)
       fit the model on 50k training data using the 9 folds
       evaluate performance on the remaining fold
establish the best combination of parameters
fit to all 500k using best combination of parameters

To do this I need to tell caret that prior to optimisation it should subset the data but for the final fit, use all the data.

I can do this by: (1) subsetting the data; (2) do the usual train stages; (3) stop the final fit (not needed); (4) establish the 'best' combination (this is in the output of the train); (5) run train on the full 500k with no parameter optimisation.

This is a bit untidy and I don't know how to stop caret training the final model, which I will never use.

Solution

This is possible by specifying the index, indexOut and indexFinal arguments to trainControl.

Here is an example using the Sonar data set from mlbench library:

library(caret)
library(mlbench)
data(Sonar)

Lets say we want to draw half of the Sonar data set each time for training, and repeat that 10 times:

train_inds <- replicate(10, sample(1:nrow(Sonar), size = nrow(Sonar)/2), simplify = FALSE)

If you are interested in a different sampling approach please post the details. This is for illustration only.

For testing we will use random 10 rows not in the train_inds:

test_inds <- lapply(train_inds, function(x){
  inds <- setdiff(1:nrow(Sonar), x)
  return(sample(inds, size = 10))
}
)

now just specify the test_inds and train_inds in trainControl:

ctrl <-  trainControl(
    method = "boot",
    number = 10,
    classProbs = T,
    savePredictions = "final",
    index = train_inds,
    indexOut = test_inds,
    indexFinal = 1:nrow(Sonar),
    summaryFunction = twoClassSummary
  )

you can also specify indexFinal if you do not wish to fit the final model on all rows.

and fit:

model <- train(
    Class ~ .,
    data = Sonar,
    method = "rf",
    trControl = ctrl,
    metric = "ROC"
  )
model
#output
Random Forest 

208 samples, 208 used for final model
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Bootstrapped (10 reps) 
Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens    Spec     
   2    0.9104167  0.7750  0.8250000
  31    0.9125000  0.7875  0.7916667
  60    0.9083333  0.7875  0.8166667