I was wanting to use purrr to group by a unique identifier and then downSample
a factor variable using the caret package. Here is the code below:
out <- train %>% select(stream, HUC12) %>%
na.omit() %>% group_by(HUC12) %>%
nest %>% mutate(prop = map(data, ~downSample(.x, factor('stream'))))
Any help would be much appreciated. Here's some sample data.
train <- data.frame(stream = factor(sample(x= 0:1, size = 100, replace = TRUE,
prob = c(0.25,.75))), HUC12 = rep(c("a","b","c","d")))
Generate data:
set.seed(100)
train <- data.frame(stream = factor(sample(x= 0:1, size = 100, replace = TRUE,
prob = c(0.25,.75))), HUC12 = rep(c("a","b","c","d")))
Try something like this, because your downSample returns a data.frame, we can use the do
function in dplyr to perform the downsampling.
library(dplyr)
down_train <- train %>% select(stream, HUC12) %>%
na.omit() %>% group_by(HUC12) %>% do(downSample(.,.$stream))
We can check:
down_train %>% count(HUC12,stream)
# A tibble: 8 x 3
# Groups: HUC12 [4]
HUC12 stream n
<fct> <fct> <int>
1 a 0 1
2 a 1 1
3 b 0 4
4 b 1 4
5 c 0 11
6 c 1 11
7 d 0 8
8 d 1 8
And in the original data:
train %>% count(HUC12,stream)
# A tibble: 8 x 3
HUC12 stream n
<fct> <fct> <int>
1 a 0 1
2 a 1 24
3 b 0 4
4 b 1 21
5 c 0 11
6 c 1 14
7 d 0 8
8 d 1 17