Search code examples
rselectgroup-bydplyrtidyeval

Multiply, select and optionally group arbitary variables programmatically with dplyr


In my code, which uses dplyr, I often perform certain operations on a dataframe variable (here assumed to be simply multiplication by 2, to simplify the MRE), optionally group on another variable, and then select only some of the resulting variables. To prevent code duplication, I want to write a function.

The test dataframe is

library(ggplot2)
msleep_mini <- msleep[1:10, ]

The function must reproduce the following behavior. If called with a single argument, say, sleep_total, it simply multiplies sleep_total by 2, and returns a dataframe containing the columns name, vore, order and sleep_total:

# test_1
msleep_mini %>%
  group_double_select(sleep_total)
#> # A tibble: 20 x 4
#>    name                       vore  order           sleep_total
#>    <chr>                      <chr> <chr>                 <dbl>
#>  1 Cheetah                    carni Carnivora              24.2
#>  2 Owl monkey                 omni  Primates               34  
#>  3 Mountain beaver            herbi Rodentia               28.8
#>  4 Greater short-tailed shrew omni  Soricomorpha           29.8
#>  5 Cow                        herbi Artiodactyla            8  
#>  6 Three-toed sloth           herbi Pilosa                 28.8
#>  7 Northern fur seal          carni Carnivora              17.4
#>  8 Vesper mouse               <NA>  Rodentia               14  
#>  9 Dog                        carni Carnivora              20.2
#> 10 Roe deer                   herbi Artiodactyla            6  

If called with two arguments, the second one is interpreted as a grouping variable. Again, the first one is multiplied by 2, but now the dataframe is also grouped by the second argument, sorted according to it, and finally a id column, containing the progressive row number inside each group, is added to the dataframe. In other words, the output would be

# test_2
msleep_mini %>%
  group_double_select(sleep_total, vore)
#> # A tibble: 20 x 5
#> # Groups:   vore [4]
#>    vore  name                       order           sleep_total    id
#>    <chr> <chr>                      <chr>                 <dbl> <int>
#>  1 carni Cheetah                    Carnivora              24.2     1
#>  2 carni Northern fur seal          Carnivora              17.4     2
#>  3 carni Dog                        Carnivora              20.2     3
#>  4 carni Long-nosed armadillo       Cingulata              34.8     4
#>  5 herbi Mountain beaver            Rodentia               28.8     1
#>  6 herbi Cow                        Artiodactyla            8       2
#>  7 herbi Three-toed sloth           Pilosa                 28.8     3
#>  8 herbi Roe deer                   Artiodactyla            6       4
#>  9 herbi Goat                       Artiodactyla           10.6     5
#> 10 herbi Guinea pig                 Rodentia               18.8     6

Of course, the function must work with arbitrary variables (as long as they can be found in the dataframe):

# test_3
msleep_mini %>%
  group_double_select(sleep_rem, order)
#> # A tibble: 20 x 5
#> # Groups:   order [9]
#>    order           name                       vore  sleep_rem    id
#>    <chr>           <chr>                      <chr>     <dbl> <int>
#>  1 Artiodactyla    Cow                        herbi       1.4     1
#>  2 Artiodactyla    Roe deer                   herbi      NA       2
#>  3 Artiodactyla    Goat                       herbi       1.2     3
#>  4 Carnivora       Cheetah                    carni      NA       1
#>  5 Carnivora       Northern fur seal          carni       2.8     2
#>  6 Carnivora       Dog                        carni       5.8     3
#>  7 Cingulata       Long-nosed armadillo       carni       6.2     1
#>  8 Didelphimorphia North American Opossum     omni        9.8     1
#>  9 Hyracoidea      Tree hyrax                 herbi       1       1
#> 10 Pilosa          Three-toed sloth           herbi       4.4     1

It seems to me that the only way to write group_double_select in a robust and maintainable way is to use tidy evaluation, but I may be wrong. Can you help me?


Solution

  • We can use missing to check whether the argument is missing in the function

    group_double_select <- function(data, colVar, groupVar) {
       colVar <- enquo(colVar)
    
    
    
       if(missing(groupVar)) {
            data %>% 
                  select(name, vore, order, !!colVar) %>% 
                  mutate(!! quo_name(colVar) :=  !! colVar * 2)
    
    
       } else {
           groupVar <- enquo(groupVar)
           data %>%
                select(name, vore, order, !!colVar) %>%
                mutate(!! quo_name(colVar) :=  !! colVar * 2) %>%
                group_by(!! groupVar) %>%
                mutate(id = row_number()) %>%
                arrange(!! groupVar)
    
    
    
    
    
    }
    
    }
    

    -testing

    msleep_mini %>%
           group_double_select(sleep_total, vore) %>%
           head
    # A tibble: 6 x 5
    # Groups:   vore [2]
    #  name                 vore  order        sleep_total    id
    #  <chr>                <chr> <chr>              <dbl> <int>
    #1 Cheetah              carni Carnivora           24.2     1
    #2 Northern fur seal    carni Carnivora           17.4     2
    #3 Dog                  carni Carnivora           20.2     3
    #4 Long-nosed armadillo carni Cingulata           34.8     4
    #5 Mountain beaver      herbi Rodentia            28.8     1
    #6 Cow                  herbi Artiodactyla         8       2
    
    
    
    msleep_mini %>% 
           group_double_select(sleep_total) %>%
           head
    # A tibble: 6 x 4
    #  name                       vore  order        sleep_total
    #  <chr>                      <chr> <chr>              <dbl>
    #1 Cheetah                    carni Carnivora           24.2
    #2 Owl monkey                 omni  Primates            34  
    #3 Mountain beaver            herbi Rodentia            28.8
    #4 Greater short-tailed shrew omni  Soricomorpha        29.8
    #5 Cow                        herbi Artiodactyla         8  
    #6 Three-toed sloth           herbi Pilosa              28.8
    
    
    
    
    msleep_mini %>%
           group_double_select(sleep_rem, order) %>%
           head
    # A tibble: 6 x 5
    # Groups:   order [2]
    #  name              vore  order        sleep_rem    id
    #  <chr>             <chr> <chr>            <dbl> <int>
    #1 Cow               herbi Artiodactyla       1.4     1
    #2 Roe deer          herbi Artiodactyla      NA       2
    #3 Goat              herbi Artiodactyla       1.2     3
    #4 Cheetah           carni Carnivora         NA       1
    #5 Northern fur seal carni Carnivora          2.8     2
    #6 Dog               carni Carnivora          5.8     3