Search code examples
rdplyrdbplyr

dbplyr group by dynamic variable names


How can I group using a dynamic variable(s) name(s) ?

My example : Try to group on Species column knowing it is in the grouping_variable var ?

library(dplyr)
library(dbplyr)
library(DBI)

# My table
iris_table <- tbl(src = my_db_conn, in_schema(schema = "my_schema", table = "iris_table"))

# The grouping variable
grouping_variable <- "Species"

# My tries
iris_table %>%
  group_by(across(any_of(grouping_variable))) %>%
  summarise(sum_petal_length = sum(Petal.Length))
### ==> returns error

iris_table %>%
  group_by(!!!grouping_variable) %>%
  summarise(sum_petal_length = sum(Petal.Length))
### ==> returns grouping by the character "Species"

My sessionInfo() :

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] DBI_1.1.0               dbplyr_1.4.4            dplyr_1.0.0             lubridate_1.7.9        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          rstudioapi_0.11     magrittr_1.5        hms_0.5.3           odbc_1.3.0          tidyselect_1.1.0   
 [7] bit_1.1-15.2        R6_2.4.1            rlang_0.4.7         fansi_0.4.1         blob_1.2.1          tools_3.6.2        
[13] utf8_1.1.4          cli_2.0.2           ellipsis_0.3.1      readxl_1.3.1        bit64_0.9-7.1       assertthat_0.2.1   
[19] tibble_3.0.3        lifecycle_0.2.0     crayon_1.3.4        zip_2.0.4           purrr_0.3.4         tidyr_1.1.0        
[25] vctrs_0.3.2         glue_1.4.1          openxlsx_4.1.5      stringi_1.4.6       cellranger_1.1.0    compiler_3.6.2     
[31] pillar_1.4.6        generics_0.0.2      pkgconfig_2.0.3    

Solution

  • Here are two potential approaches.

    (1) Most similar to the approach you are already using, we first have to tell R that the character string should be treated as symbolic:

    iris_table %>%
      group_by(!!!syms(grouping_variable)) %>%
      summarise(sum_petal_length = sum(Petal.Length))
    

    Note the syms before the !!!. This approach uses some features of the rlang package that can be useful in other contexts. However, it is no longer the recommended approach for programming with dplyr.

    (2) The recommended approach for doing this kind of programming with dplyr is:

    iris_table %>%
      group_by(.data[[grouping_variable]]) %>%
      summarise(sum_petal_length = sum(Petal.Length))
    

    Both of these approaches will give you the correct SQL translation when working with dbplyr:

    data(iris)
    iris_table = tbl_lazy(iris, con = simulate_mssql())
    # The grouping variable
    grouping_variable <- "Species"
    
    # approach 1
    iris_table %>%
      group_by(!!!syms(grouping_variable)) %>%
      summarise(sum_petal_length = sum(Petal.Length))
    # translation from approach 1
    # <SQL>
    # SELECT `Species`, SUM(`Petal.Length`) AS `sum_petal_length`
    # FROM `df`
    # GROUP BY `Species`
    
    # approach 2
    iris_table %>%
      group_by(.data[[grouping_variable]]) %>%
      summarise(sum_petal_length = sum(Petal.Length))
    # translation from approach 2
    # <SQL>
    # SELECT `Species`, SUM(`Petal.Length`) AS `sum_petal_length`
    # FROM `df`
    # GROUP BY `Species`