Search code examples
rmachine-learningpartitioningr-caret

How to partition train/test data set based on the centers?


I have a data set with 3 predictors (P1-P3) and 1 response variable as outcome (Response). The data has been gathered from 5 centers (200 IDs). I split the whole data into Train(70%) and Test(30%).

Sample data:

ID  Centers   P1    P2  P3  Response
ID1 Center1   12    1   1   Class1
ID2 Center2   73    1   3   Class2
ID3 Center3   56    2   1   Class1
ID4 Center1   44    1   3   Class2
ID5 Center4   33    1   1   Class1
ID6 Center5   26    1   1   Class2
ID7 Center2   61    1   1   Class1
ID8 Center3   44    1   3   Class2
ID9 Center5   45    1   1   Class1

I want a partitioning of train and test data set that considers centers and classes of outcome variable, what I could write is

library(caret)
set.seed(123)
train.index <- createDataPartition(data$Response, p = .7, list = FALSE)
train <- data[ train.index,]
test  <- data[-train.index,]

How can I write the code in a way that the partitioning would choose data from all centers?


Solution

  • Maybe it's not the perfect answer, but I got a similar problem, and I manage it using dplyr::group_by and dplyr::sanple_n. I needed balanced train and test by group, and a test dataset that was a subset of my data, of individuals that were not in the train dataset.

    For example, using the famous mtcars dataset:

    library(dplyr)
    mtcars %>%              # in your case your data
        group_by(cyl) %>%   # in your case Centers
        sample_n(2)         # here the numbers of the sample for each group
    

    So it becomes:

    train <- data %>% group_by(Centers) %>% sample_n(28)
    

    This means that if you have 200 rows, and 5 centers, and the same number of individuals for each centers,(let's call it balanced), you have 200/5 = 40 for each group, so the sample_n without repetition could be max 40.

    In case of balanced data per group, if my math is not wrong, you can set to 28 (200/100*70/5), to have the 70% coverage, balanced for each group.

    If the group are not balanced, without repetition you can put the parameter till the smallest group.

    In other hands, you have to set the repetition.

    To set the testing, if you want to have the individuals that are not in the training, you can do this:

    test <- data %>% filter(!ID %in% train%ID)
    

    Hope it helps.