Search code examples
rpurrrtidyeval

Iterate Group_by across Dataframe in R


I'm trying to simplify a current piece of code in my script.

I want to group by each possible combination of two categorical variables and summarise a mean value of my explanatory variable.

Example using mpg database found in ggplot2;

library(tidyverse)

   mpg %>% group_by(manufacturer, model) %>% summarise(mean = mean(hwy))
   mpg %>% group_by(manufacturer, year) %>% summarise(mean = mean(hwy))
   mpg %>% group_by(manufacturer, cyl) %>% summarise(mean = mean(hwy)) 

(this would continue until all combination of categorical variables - columns is done)

mpg %>% group_by(cyl, year) %>% summarise(mean = mean(hwy))

etc...

My actual database has hundreds of categorical variables so I would like to iterate the process in a for loop or using purrr for example.

Thanks


Solution

  • This uses purrr to select character and factor columns and then combn() to select all of the combinations.

    library(ggplot2)
    library(purrr)
    library(dplyr)
    
    map_lgl(mpg, ~ is.character(.) | is.factor(.))%>%
      names(.)[.]%>%
      combn(2, function(x) {mpg%>%group_by_at(x)%>%summarize(mean = mean(hwy))}, simplify = F)
    

    Note, this can become messy as choose(100,2) evaluates to 4,950 combinations.