Search code examples
rtidyversepurrrr-caret

Downsampling using purrr. Unique identifier


I was wanting to use to group by a unique identifier and then downSample a factor variable using the package. Here is the code below:

out <- train %>% select(stream, HUC12) %>% 
  na.omit() %>% group_by(HUC12) %>% 
  nest %>% mutate(prop = map(data, ~downSample(.x, factor('stream'))))

Any help would be much appreciated. Here's some sample data.

train <- data.frame(stream = factor(sample(x= 0:1, size = 100, replace = TRUE, 
                    prob = c(0.25,.75))), HUC12 = rep(c("a","b","c","d")))

Solution

  • Generate data:

    set.seed(100)
    train <- data.frame(stream = factor(sample(x= 0:1, size = 100, replace = TRUE, 
                        prob = c(0.25,.75))), HUC12 = rep(c("a","b","c","d")))
    

    Try something like this, because your downSample returns a data.frame, we can use the do function in dplyr to perform the downsampling.

    library(dplyr)
    down_train <- train %>% select(stream, HUC12) %>%  
    na.omit() %>% group_by(HUC12) %>%  do(downSample(.,.$stream))
    

    We can check:

    down_train %>% count(HUC12,stream)
    
    # A tibble: 8 x 3
    # Groups:   HUC12 [4]
      HUC12 stream     n
      <fct> <fct>  <int>
    1 a     0          1
    2 a     1          1
    3 b     0          4
    4 b     1          4
    5 c     0         11
    6 c     1         11
    7 d     0          8
    8 d     1          8
    

    And in the original data:

    train %>% count(HUC12,stream)
    # A tibble: 8 x 3
      HUC12 stream     n
      <fct> <fct>  <int>
    1 a     0          1
    2 a     1         24
    3 b     0          4
    4 b     1         21
    5 c     0         11
    6 c     1         14
    7 d     0          8
    8 d     1         17