My question is very similar to the one asked in caret: combine createResample and groupKFold
The only difference: I need to create stratified folds (also repeated 10 times) after grouping instead of bootstrapped resamples (which are not stratified as far as I know) for using it with caret's trainControl.
The following code is working with 10-fold repeated CV but I couldn't include the grouping of the data based on an "ID" (df$ID
).
# creating indices
cv.10.folds <- createMultiFolds(rf_label, k = 10, times = 10)
# creating folds
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 10, index = cv.10.folds)
# train
rf.ctrl10 <- train(rf_train, y = rf_label, method = "rf", tuneLength = 6,
ntree = 1000, trControl = ctrl.10fold, importance = TRUE)
That's my actual problem: My data contains many groups composed of 20 instances each, having the same "ID". So, when using the 10-fold CV repeated 10 times I get some instances of a group in the training and some in the validation set. This I want to avoid, but overall I need a stratified partitioning for the prediction value (df$Label
). (All instances having the same "ID" also have the same prediction/label value.)
In the provided and accepted answer from the link above (see parts below) I guess I have to modify the folds2
line to contain the stratified 10-fold CV instead of the bootstrapped
folds <- groupKFold(x)
folds2 <- lapply(folds, function(x) lapply(1:10, function(i) sample(x, size = length(x), replace = TRUE)))
but unfortunately I cannot figure out how exactly. Could you help me with that?
Here is an approach to perform stratified repeated K-fold CV with blocking.
library(caret)
library(tidyverse)
some fake data where id will be the blocking factor:
id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id, y, x)
summarise the observations by the blocking factor:
df %>%
group_by(id) %>%
summarise(mean = mean(y)) %>%
ungroup() -> groups1
create the stratified folds based on the grouped data:
folds <- createMultiFolds(groups1$mean, 10, 3)
back join the original df to the group data and take the df row id's
folds <- lapply(folds, function(i){
data.frame(id = i) %>%
left_join(df %>%
rowid_to_column()) %>%
pull(rowid)
})
check if the data id's in the test are not in the train:
lapply(folds, function(i){
sum(df[i,1] %in% df[-i,1])
})
output is a bunch of zeros, meaning no id's in the test folds are in the train folds.
If your group id's are not numeric there are two approaches to make this work:
1 convert them to numeric:
first some data
id <- sample(1:55, size = 1000, replace = T)
y <- rnorm(1000)
x <- matrix(rnorm(10000), ncol = 10)
df <- data.frame(id = paste0("id_", id), y, x) #factor id's
df %>%
mutate(id = as.numeric(id)) %>% #convert to numeric
group_by(id) %>%
summarise(mean = mean(y)) %>%
ungroup() -> groups1
folds <- createMultiFolds(groups1$mean, 10, 3)
folds <- lapply(folds, function(i){
data.frame(id = i) %>%
left_join(df %>%
mutate(id = as.numeric(id)) %>% #also need to convert to numeric in the original data frame
rowid_to_column()) %>%
pull(rowid)
})
2 filter the id's in grouped data according to fold indexes and then join by id's
df %>%
group_by(id) %>%
summarise(mean = mean(y)) %>%
ungroup() -> groups1
folds <- createMultiFolds(groups1$mean, 10, 3)
folds <- lapply(folds, function(i){
groups1 %>% #start from grouped data
select(id) %>% #select id's
slice(i) %>% #filter id's according to fold index
left_join(df %>% #join by id
rowid_to_column()) %>%
pull(rowid)
})
Will it work for caret?
ctrl.10fold <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = folds)
rf.ctrl10 <- train(x = df[,-c(1:2)], y = df$y, data = df, method = "rf", tuneLength = 1,
ntree = 20, trControl = ctrl.10fold, importance = TRUE)
rf.ctrl10$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 3 1.041641 0.007534611 0.8246514 0.06953668 0.009488169 0.05934975
Also I suggest you check out library mlr
, it has many nice features including blocking - here is one answer on SO. It has very nice tutorials on many things. For a long time I thought you either use caret
or mlr
but they complement each other very nicely.