I have a data set with 3 predictors (P1-P3) and 1 response variable as outcome (Response). The data has been gathered from 5 centers (200 IDs). I split the whole data into Train(70%) and Test(30%).
Sample data:
ID Centers P1 P2 P3 Response
ID1 Center1 12 1 1 Class1
ID2 Center2 73 1 3 Class2
ID3 Center3 56 2 1 Class1
ID4 Center1 44 1 3 Class2
ID5 Center4 33 1 1 Class1
ID6 Center5 26 1 1 Class2
ID7 Center2 61 1 1 Class1
ID8 Center3 44 1 3 Class2
ID9 Center5 45 1 1 Class1
I want a partitioning of train and test data set that considers centers and classes of outcome variable, what I could write is
library(caret)
set.seed(123)
train.index <- createDataPartition(data$Response, p = .7, list = FALSE)
train <- data[ train.index,]
test <- data[-train.index,]
How can I write the code in a way that the partitioning would choose data from all centers?
Maybe it's not the perfect answer, but I got a similar problem, and I manage it using dplyr::group_by
and dplyr::sanple_n
. I needed balanced train and test by group, and a test
dataset that was a subset of my data, of individuals that were not in the train
dataset.
For example, using the famous mtcars
dataset:
library(dplyr)
mtcars %>% # in your case your data
group_by(cyl) %>% # in your case Centers
sample_n(2) # here the numbers of the sample for each group
So it becomes:
train <- data %>% group_by(Centers) %>% sample_n(28)
This means that if you have 200 rows, and 5 centers, and the same number of individuals for each centers,(let's call it balanced), you have 200/5 = 40 for each group, so the sample_n
without repetition could be max 40.
In case of balanced data per group, if my math is not wrong, you can set to 28 (200/100*70/5), to have the 70% coverage, balanced for each group.
If the group are not balanced, without repetition you can put the parameter till the smallest group.
In other hands, you have to set the repetition.
To set the testing, if you want to have the individuals that are not in the training, you can do this:
test <- data %>% filter(!ID %in% train%ID)
Hope it helps.