Search code examples
rmachine-learningmlr

Can we use a pre-defined column for CV (resampling) in mlr?


To conduct a cross-validation (resampling) in mlr R package, normally we need to call makeResampleDesc function to specify the methods and folds.

My questions are:

  1. Would it be possible to use a pre-defined column as a fold column? Or,
  2. The makeResampleDesc in mlr makes sure that the folds created are consistent (between different learners under the same seed of cause), and can be exported for further manipulation?

Solution

  • The resample description is independent of any learner; you can use one with several learners and get the same folds. You can also extract the fold number from the resample result if you want to link them back to the original data.

    You can use a column in the data as the fold column using the blocking argument to makeClassifTask. From the help:

    blocking: [‘factor’]

          An optional factor of the same length as the number of
          observations. Observations with the same blocking level
          “belong together”. Specifically, they are either put all in
          the training or the test set during a resampling iteration.
          Default is ‘NULL’ which means no blocking.