Search code examples
rdplyrgroupingsummarize

Summarize using different grouping variables in dplyr


I would like summarize a dataframe using different grouping variables for each summary I wish to be carried out. As an example I have three variables (x1, x2, x3). I want to group the dataframe by x1 and get the number of observations in that group, but I want to do the same for x2 and x3.

I would like to accomplish this with the same block of piping but so far the only solution I have come up with is to save multiple outputs for each individual grouping I would like.

To reproduce my dataframe:

x1 <- c(0,1,1,2,2,3,3,3,4,4,5,6,6,7,8,9,9,10)
x2 <- c(0,0,1,1,0,1,2,0,0,2,1,0,3,4,2,3,0,3)
x3 <- c(0,1,0,1,2,2,1,3,4,2,4,6,3,3,6,6,9,7)
df <- data.frame(x1,x2,x3)

My expected output would look something like this, where x is the min and max number across the variables and n_x1-3 are the number of observations at a specific number and using that variable as a grouping variable:

       x     n_x1     n_x2     n_x3
 1     0        1        7        2
 2     1        2        4        3
 3     2        2        3        3
 4     3        3        3        3
 5     4        2        1        2
 6     5        1       NA       NA
 7     6        2       NA        3
 8     7        1       NA        1
 9     8        1       NA       NA
10     9        2       NA        1
11    10        1       NA       NA

So far I have come up with summarizing and grouping by each variable individually and then joining them all together as a last step.

x1_count <- df %>%
  group_by(x1) %>%
  summarise(n_x1=n())

x2_count <- df %>%
  group_by(x2) %>%
  summarise(n_x2=n())

x3_count <- df %>%
  group_by(x3) %>%
  summarise(n_x3=n())

all_count <- full_join(x1_count, x2_count,
                       by=c("x1"="x2")) %>%
  full_join(., x3_count,
            by=c("x1"="x3")) %>%
  rename("x"="x1")

Is there some type of work around where I wouldn't have to output multiple dataframes and later join them together. I would prefer a cleaner more elegant solution.


Solution

  • a simple tidyr solution

    library(tidyr)
    df %>%
      pivot_longer(everything(),names_to="variables",values_to="values") %>%
      group_by(variables,values) %>%
      summarize(n_x=n()) %>%
      ungroup() %>%
      pivot_wider(names_from = variables,values_from=n_x)
    
    # A tibble: 11 x 4
       values    x1    x2    x3
        <dbl> <int> <int> <int>
     1      0     1     7     2
     2      1     2     4     3
     3      2     2     3     3
     4      3     3     3     3
     5      4     2     1     2
     6      5     1    NA    NA
     7      6     2    NA     3
     8      7     1    NA     1
     9      8     1    NA    NA
    10      9     2    NA     1
    11     10     1    NA    NA