Search code examples
rdplyrgroup-bysummarize

Difference between .groups argument and ungroup() in dplyr?


I'm looking at some code:

df1 <- inner_join(metadata, otu_counts, by="sample_id") %>%
  inner_join(., taxonomy, by="otu") %>% 
  group_by(sample_id) %>%
  mutate(rel_abund = count / sum(count)) %>% 
  ungroup() %>% 
  select(-count)

This first chunk I completely understand but I'm new and I can only assume that this second chunk's '.group = "drop"' does the same thing as the previous ungroup().

If so, then does it have to do with the last function being a summarize() function?

df2 <- df1 %>%
  filter(level=="phylum") %>%
  group_by(disease_stat, sample_id, taxon) %>%
  summarize(rel_abund = sum(rel_abund), .groups="drop") %>% #
  group_by(disease_stat, taxon) %>%
  summarize(mean_rel_abund = 100*mean(rel_abund), .groups="drop") 

Can someone explain?

UPDATE: I realize that the first .groups = "drop" eliminates a newly created variable which was sample_id. Is there more to this?


Solution

  • This is a special behavior/capability of summarize. When you group data by multiple variables, summarize defaults to keeping the first grouping in the output data frame.

    library(wec)
    library(dplyr)
    
    data(PUMS)
    
    PUMS %>%
      group_by(race, education.cat) %>%
      summarise(hi = mean(wage))
    
    # # A tibble: 8 × 3
    # # Groups:   race [4]
    #   race     education.cat     hi
    #   <fct>    <fct>          <dbl>
    # 1 Hispanic High school   35149.
    # 2 Hispanic Degree        52344.
    # 3 Black    High school   30552.
    # 4 Black    Degree        48243.
    # 5 Asian    High school   35350 
    # 6 Asian    Degree        78213.
    # 7 White    High school   38532.
    # 8 White    Degree        69135.
    

    Notice that the above data frame still has 4 groups. If you use the .groups = "drop" argument in summarize, the output numbers are identical but the data frame has no groups.

    PUMS %>%
      group_by(race, education.cat) %>%
      summarise(hi = mean(wage), .groups = "drop")
    
    # # A tibble: 8 × 3
    #   race     education.cat     hi
    #   <fct>    <fct>          <dbl>
    # 1 Hispanic High school   35149.
    # 2 Hispanic Degree        52344.
    # 3 Black    High school   30552.
    # 4 Black    Degree        48243.
    # 5 Asian    High school   35350 
    # 6 Asian    Degree        78213.
    # 7 White    High school   38532.
    # 8 White    Degree        69135.
    

    The mutate function in the first of your examples does not have a built in .groups functionality, so you have to take an extra line to ungroup() if you wish to do so afterwards.