Search code examples
rdplyrtidyverse

dplyr::group_by() with multiple variables but NOT intersection


When you group_by multiple variables, dplyr helpfully finds the intersection of those groups.

For example,

mtcars %>% 
  group_by(cyl, am) %>%
  summarise(mean(disp))

yields

Source: local data frame [6 x 3]
Groups: cyl [?]

    cyl    am `mean(disp)`
  <dbl> <dbl>        <dbl>
1     4     0     135.8667
2     4     1      93.6125
3     6     0     204.5500
4     6     1     155.0000
5     8     0     357.6167
6     8     1     326.0000

My question is, is there a way to provide multiple variables, but to summarize marginally? I want output like what you get if you do this by hand, variable by variable.

df_1 <- 
  mtcars %>% 
  group_by(cyl) %>%
  summarise(est = mean(disp)) %>%
  transmute(group = paste0("cyl_", cyl), est)

df_2 <- 
  mtcars %>% 
  group_by(am) %>%
  summarise(est = mean(disp)) %>%
  transmute(group = paste0("am_", am), est)

bind_rows(df_1, df_2)

The above code yields

# A tibble: 5 × 2
  group      est
  <chr>    <dbl>
1 cyl_4 105.1364
2 cyl_6 183.3143
3 cyl_8 353.1000
4  am_0 290.3789
5  am_1 143.5308

ideally, the syntax would be something like

mtcars %>%
group_by(cyl, am, intersection = FALSE) %>%
summarise(est = mean(disp))

Does something like this exist in the tidyverse?

(p.s., I get that my group variable in the table above isn't tidy in the sense that it contains two variables in one, but I promise for my purpose it's tidy, OK? :) )


Solution

  • I'm guessing what you're looking for is the tidyr package...

    gather first duplicates the dataset so that there are n rows for each factor by which grouping will occur; mutate then creates the grouping variable.

    library(dplyr)
    library(tidyr)
    
    mtcars %>%
      gather(col, value, cyl, am) %>% 
      mutate(group = paste(col, value, sep = "_")) %>%
      group_by(group) %>% 
      summarise(est = mean(disp))