Search code examples
rdplyrtidyeval

Group by dynamically generated column with pattern in `dplyr`


I am trying to group by a column that is dynamically generated inside previous steps in my function. The column will have a pattern (suffix). I could use dplyr::contains(), but it is not the most accurate way, since I only have one column and could potentially create nasty bugs if other columns with suffix appear in the future.

fun <- function(df, target){
  # previous steps generate a dynamic columns
  # dynamic columns have suffixes
  # let's assume I only care about Length
  df %>% 
    group_by("{target}.Length")
  # do more stuff
}

I am using dplyr == 1.0.8 and thought that I this type of thing was supported. I am probably missing something really simple, and "Petal" does not get pasted properly and interpreted as a proper column name. I tried several combinations of {} and {{}} and it is not working as I would expect. For example:

> fun(iris, "Petal") %>% head()
# A tibble: 6 × 6
# Groups:   "{target}.Length" [1]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species `"{target}.Length"`
         <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>              
1          5.1         3.5          1.4         0.2 setosa  {target}.Length    
2          4.9         3            1.4         0.2 setosa  {target}.Length    
3          4.7         3.2          1.3         0.2 setosa  {target}.Length    
4          4.6         3.1          1.5         0.2 setosa  {target}.Length    
5          5           3.6          1.4         0.2 setosa  {target}.Length    
6          5.4         3.9          1.7         0.4 setosa  {target}.Length 

I am using the same type of notation for mutate() successfully, but it doesn't for group_by().


Solution

  • We can use !! sym(...) on a string to evaluate it early as object (name).

    We can also use the .data pronoun, but when using it inside a package you should import it to avoid R CMD check notes (see here).

    library(dplyr)
    
    fun <- function(df, target){
      df %>% 
        group_by(!! sym(paste0(target,".Length")))
    }
    
    fun(iris, "Petal") %>% head()
    
    #> # A tibble: 6 x 5
    #> # Groups:   Petal.Length [4]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
    #> 1          5.1         3.5          1.4         0.2 setosa 
    #> 2          4.9         3            1.4         0.2 setosa 
    #> 3          4.7         3.2          1.3         0.2 setosa 
    #> 4          4.6         3.1          1.5         0.2 setosa 
    #> 5          5           3.6          1.4         0.2 setosa 
    #> 6          5.4         3.9          1.7         0.4 setosa
    

    Why does this work?

    Lets have a look at it step by step when building our function above:

    fun1 <- function(df){
      df %>% 
        group_by("Sepal.Length")
    }
    

    ⬇️ groups by new variable which is named after its value / call

    iris %>% 
      fun1()
    #> # A tibble: 150 x 6
    #> # Groups:   "Sepal.Length" [1]
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species `"Sepal.Length"`
    #>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>           
    #>  1          5.1         3.5          1.4         0.2 setosa  Sepal.Length    
    #>  2          4.9         3            1.4         0.2 setosa  Sepal.Length    
    #>  3          4.7         3.2          1.3         0.2 setosa  Sepal.Length    
    #>  4          4.6         3.1          1.5         0.2 setosa  Sepal.Length    
    #>  5          5           3.6          1.4         0.2 setosa  Sepal.Length    
    #>  6          5.4         3.9          1.7         0.4 setosa  Sepal.Length    
    #>  7          4.6         3.4          1.4         0.3 setosa  Sepal.Length    
    #>  8          5           3.4          1.5         0.2 setosa  Sepal.Length    
    #>  9          4.4         2.9          1.4         0.2 setosa  Sepal.Length    
    #> 10          4.9         3.1          1.5         0.1 setosa  Sepal.Length    
    #> # ... with 140 more rows
    
    fun2 <- function(df){
      df %>% 
        group_by(as.name("Sepal.Length"))
    }
    

    ⬇️ error: group_by needs a vetor as input, but has received a name / symbol

    iris %>% 
      fun2()
    #> Error in `add_computed_columns()`:
    #> ! Problem adding computed columns in `group_by()`.
    #> x Problem with `mutate()` input `..1`.
    #> i `..1 = as.name("Sepal.Length")`.
    #> x `..1` must be a vector, not a symbol.
    #> Caused by error in `mutate_cols()`:
    #> ! Problem with `mutate()` input `..1`.
    #> i `..1 = as.name("Sepal.Length")`.
    #> x `..1` must be a vector, not a symbol.
    
    
    
    fun3 <- function(df){
      df %>% 
        group_by(eval(as.name("Sepal.Length")))
    }
    

    ⬇️ works, but creates new column which is named after call to eval:

    iris %>% 
      fun3()
    #> # A tibble: 150 x 6
    #> # Groups:   eval(as.name("Sepal.Length")) [35]
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species `eval(as.name("Sep~
    #>           <dbl>       <dbl>        <dbl>       <dbl> <fct>                 <dbl>
    #>  1          5.1         3.5          1.4         0.2 setosa                  5.1
    #>  2          4.9         3            1.4         0.2 setosa                  4.9
    #>  3          4.7         3.2          1.3         0.2 setosa                  4.7
    #>  4          4.6         3.1          1.5         0.2 setosa                  4.6
    #>  5          5           3.6          1.4         0.2 setosa                  5  
    #>  6          5.4         3.9          1.7         0.4 setosa                  5.4
    #>  7          4.6         3.4          1.4         0.3 setosa                  4.6
    #>  8          5           3.4          1.5         0.2 setosa                  5  
    #>  9          4.4         2.9          1.4         0.2 setosa                  4.4
    #> 10          4.9         3.1          1.5         0.1 setosa                  4.9
    #> # ... with 140 more rows
    
    
    fun4 <- function(df){
      df %>% 
        group_by(!! as.name("Sepal.Length"))
    }
    

    ⬇️ works: !! does the same as eval, but works only inside rlang powered functions and it is evaluated faster than the normal eval. For this reason group_by correctly gets the name of the column to evaluate.

    iris %>% 
      fun4()
    #> # A tibble: 150 x 5
    #> # Groups:   Sepal.Length [35]
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
    #>  1          5.1         3.5          1.4         0.2 setosa 
    #>  2          4.9         3            1.4         0.2 setosa 
    #>  3          4.7         3.2          1.3         0.2 setosa 
    #>  4          4.6         3.1          1.5         0.2 setosa 
    #>  5          5           3.6          1.4         0.2 setosa 
    #>  6          5.4         3.9          1.7         0.4 setosa 
    #>  7          4.6         3.4          1.4         0.3 setosa 
    #>  8          5           3.4          1.5         0.2 setosa 
    #>  9          4.4         2.9          1.4         0.2 setosa 
    #> 10          4.9         3.1          1.5         0.1 setosa 
    #> # ... with 140 more rows
    
    fun5 <- function(df){
      df %>% 
        group_by(!! sym("Sepal.Length"))
    }
    

    works: sym is almost simmilar to base R’s as.name, but it's shorter.

    iris %>% 
      fun5()
    
    #> # A tibble: 150 x 5
    #> # Groups:   Sepal.Length [35]
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
    #>  1          5.1         3.5          1.4         0.2 setosa 
    #>  2          4.9         3            1.4         0.2 setosa 
    #>  3          4.7         3.2          1.3         0.2 setosa 
    #>  4          4.6         3.1          1.5         0.2 setosa 
    #>  5          5           3.6          1.4         0.2 setosa 
    #>  6          5.4         3.9          1.7         0.4 setosa 
    #>  7          4.6         3.4          1.4         0.3 setosa 
    #>  8          5           3.4          1.5         0.2 setosa 
    #>  9          4.4         2.9          1.4         0.2 setosa 
    #> 10          4.9         3.1          1.5         0.1 setosa 
    #> # ... with 140 more rows
    

    Created on 2022-10-06 by the reprex package (v2.0.1)