Search code examples
rdplyrtidyevalnse

specify variable names when grouping


I am using dplyr v1.0.2 to manipulate tibbles. I would like to use group_by(), using a function or a regular expression to specify the relevant variable names (the ... argument). The only solution that I've found is clunky. Is there a relatively simple way?

Here is a minimal example that demonstrates the problem:

library(dplyr)
data(iris)
iris[, -(rbinom(1, 1, .5) + 1) ] %>%  # randomly drop "Sepal.Length" or "Sepal.Width"
  group_by(matches("^Sepal\\."))

In the third line, I randomly drop one of the two "Sepal" columns. In the last line, I want to group by the remaining "Sepal" column. The problem is that I don't know its name: it could be either "Sepal.Length" or "Sepal.Width." And the group_by() command in the last line doesn't work: it predictably returns a matches() must be used within a *selecting* function error message.

By contrast, this code works, but it is a bit clunky:

iris[, -(rbinom(1, 1, .5) + 1) ]  %>%
  group_by(!!as.name(grep('Sepal', colnames(.), val = TRUE)))

Is there a simpler way to do the grouping on the second line?


Solution

  • What about using across to select the columns

    iris[, -(rbinom(1, 1, .5) + 1) ]  %>%
      group_by(across(starts_with('Sepal')))
    

    # A tibble: 150 x 4
    # Groups:   Sepal.Length [35]
       Sepal.Length Petal.Length Petal.Width Species
              <dbl>        <dbl>       <dbl> <fct>  
     1          5.1          1.4         0.2 setosa 
     2          4.9          1.4         0.2 setosa 
     3          4.7          1.3         0.2 setosa 
     4          4.6          1.5         0.2 setosa 
     5          5            1.4         0.2 setosa 
     6          5.4          1.7         0.4 setosa 
     7          4.6          1.4         0.3 setosa 
     8          5            1.5         0.2 setosa 
     9          4.4          1.4         0.2 setosa 
    10          4.9          1.5         0.1 setosa 
    # … with 140 more rows