Search code examples
rtidymodels

Use multiple strata for initial_split?


I am working with biological data, genes, that have multiple characteristics which I want to have reflected properly in my training and test data.

However, the initial_split function only accepts one strata. Is there a good way to create an initial split of my data using multiple strata? Preferably using tidymodels / tidyverse.

Thank you!


Solution

  • You would have to make a composite column to stratify on. We've confined the strata to one column on purpose; the resulting sample sizes can get very small and you may not be able to stratify.

    Another approach that you can use (that I will eventually add a PR for) is to use twinning (corresponding R package).

    If you still want an initial_split object, you can make one using rsample::make_splits using the results of the twinning results.