Search code examples
rdplyrgroup-bydummy-variable

Using group_by and summarise_all to create dummy indicators for categorical variable


I want to generate dummy indicators for each id for the given categorical variable fruit. I observe the following warning when using summarise_all and self defined function. I also tried to use summarise_all(any) and it gave me warning when coercing double to logical. Is there any efficient or updated way to implement this? Thanks a lot!

fruit = c("apple", "banana", "orange", "pear",
          "strawberry", "blueberry", "durian",
          "grape", "pineapple")
df_sample = data.frame(id = c(rep("a", 3), rep("b", 5), rep("c", 6)),
                       fruit = c(sample(fruit, replace = T, size = 3),
                                 sample(fruit, replace = T, size = 5),
                                 sample(fruit, replace = T, size = 6)))

fruit_indicator = 
  model.matrix(~ -1 + fruit, df_sample) %>%
  as.data.frame() %>%
  bind_cols(df_sample) %>%
  select(-fruit) %>%
  group_by(id) %>%
  summarise_all(funs(ifelse(any(. > 0), 1, 0)))


# Warning message:
#   `funs()` is deprecated as of dplyr 0.8.0.
# Please use a list of either functions or lambdas: 
#   
#   # Simple named list: 
#   list(mean = mean, median = median)
# 
#   # Auto named with `tibble::lst()`: 
#   tibble::lst(mean, median)
# 
#   # Using lambdas
#   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))


Solution

  • You can use across which is available in dplyr 1.0.0 or higher.

    library(dplyr)
    
    model.matrix(~ -1 + fruit, df_sample) %>%
      as.data.frame() %>%
      bind_cols(df_sample) %>%
      select(-fruit) %>%
      group_by(id) %>%
      summarise(across(.fns = ~as.integer(any(. > 0))))
    
    #  id    fruitapple fruitbanana fruitdurian fruitgrape fruitpear
    #* <chr>      <int>       <int>       <int>      <int>     <int>
    #1 a              0           1           1          0         1
    #2 b              1           0           0          1         0
    #3 c              0           1           0          1         1
    # … with 1 more variable: fruitpineapple <int>