Search code examples
rdplyrsummarize

How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?


I started getting a new message (see post title) when running group_by and summarise() after updating to dplyr development version 0.8.99.9003.

Here is an example to recreate the output:

library(tidyverse)
library(hablar)
df <- read_csv("year, week, rat_house_females, rat_house_males, mouse_wild_females, mouse_wild_males 
               2018,10,1,1,1,1
               2018,10,1,1,1,1
               2018,11,2,2,2,2
               2018,11,2,2,2,2
               2019,10,3,3,3,3
               2019,10,3,3,3,3
               2019,11,4,4,4,4
               2019,11,4,4,4,4") %>% 
  convert(chr(year,week)) %>% 
  mutate(total_rodents = rowSums(select_if(., is.numeric))) %>% 
  convert(num(year,week)) %>% 
  group_by(year,week) %>% summarise(average = mean(total_rodents))

The output tibble is correct, but this message appears:

summarise() regrouping output by 'year' (override with .groups argument)

How should this be interpreted? Why does it report regrouping only by 'year' when I grouped by both year and week? Also, what does it mean to override and why would I want to do that?

I don't think the message indicates a problem because it appears throughout the dplyr vignette: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

I believe it is a new message because it has only appeared on very recent SO questions such as How to melt pairwise.wilcox.test output using dplyr? and R Aggregate over multiple columns (neither of which addresses the regrouping/override message).

Thank you!


Solution

  • It is just a friendly warning message about the resulting grouping structure; your output is correct. By default, if there is any grouping before the summarise, it drops one group variable i.e. the last one specified in the group_by. If there is only one grouping variable, there won't be any grouping attribute after the summarise. If there are more than one, the grouping is reduced by 1. So in your example since the input to summarise had two variables, the attribute for grouping is reduced to one, i.e. the resulting data frame would have 'year' as the grouping attribute.

    As a reproducible example:

    library(dplyr)
    mtcars %>%
         group_by(am) %>% 
         summarise(mpg = sum(mpg))
    #`summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 2 x 2
    #     am   mpg
    #* <dbl> <dbl>
    #1     0  326.
    #2     1  317.
    

    The message is that it is ungrouping i.e when there is a single group_by, it drops that grouping after the summarise

    mtcars %>% 
       group_by(am, vs) %>% 
       summarise(mpg = sum(mpg))
    #`summarise()` regrouping output by 'am' (override with `.groups` argument)
    # A tibble: 4 x 3
    # Groups:   am [2]
    #     am    vs   mpg
    #  <dbl> <dbl> <dbl>
    #1     0     0  181.
    #2     0     1  145.
    #3     1     0  118.
    #4     1     1  199.
    

    Here, it drops the last grouping and regroup with the 'am'

    If we check the ?summarise, there is .groups argument which by default is "drop_last" and the other options are "drop", "keep", "rowwise"

    .groups - Grouping structure of the result.

    "drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.

    "drop": All levels of grouping are dropped.

    "keep": Same grouping structure as .data.

    "rowwise": Each row is its own group.

    When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.

    i.e. if we change the .groups in summarise, we don't get the message because the group attributes are removed

    mtcars %>% 
        group_by(am) %>%
        summarise(mpg = sum(mpg), .groups = 'drop')
    # A tibble: 2 x 2
    #     am   mpg
    #* <dbl> <dbl>
    #1     0  326.
    #2     1  317.
    
    
    mtcars %>%
       group_by(am, vs) %>%
       summarise(mpg = sum(mpg), .groups = 'drop')
    # A tibble: 4 x 3
    #     am    vs   mpg
    #* <dbl> <dbl> <dbl>
    #1     0     0  181.
    #2     0     1  145.
    #3     1     0  118.
    #4     1     1  199.
    
    
    mtcars %>% 
       group_by(am, vs) %>% 
       summarise(mpg = sum(mpg), .groups = 'drop') %>%
       str
    #tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
    # $ am : num [1:4] 0 0 1 1
    # $ vs : num [1:4] 0 1 0 1
    # $ mpg: num [1:4] 181 145 118 199
    

    Previously, this warning was not issued and it could lead to situations where the OP does a mutate or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute

    NOTE: The .groups right now is experimental in its lifecycle. So, the behaviour could be modified in the future releases

    Depending upon whether we need any further transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups.