Search code examples
rdplyrgroup-bysummary

Group by 2 data elements and then provide a count of values within an additional column for each group


Hi I have the below dataset and code.

 df <- 
  read.table(textConnection("Area   CustomerAge CustomersInGroup
    A 28 1
    A 40 3
    A 70 2
    A 19 4
    B 13 3
    B 12 2
    B 72 1
    B 90 7"), header=TRUE)


df2 <- df %>% 
  mutate(
    # Create categories
    Customer_Age_Group = dplyr::case_when(
      CustomerAge <= 18            ~ "0-18",
      CustomerAge > 18 & CustomerAge <= 60 ~ "19-60",
      CustomerAge > 60             ~ ">60"
    ))

df2 %>% group_by(Area, Customer_Age_Group,.drop = FALSE) %>% 
  count() %>% 
  ungroup() %>% 
  complete(Area, Customer_Age_Group, fill=list(n=0)) %>% 
  arrange(Area, parse_number(Customer_Age_Group))

Problem is that it provides a final count of applicable rows per group - I would like to amend this so instead the final count is of the values of 'CustomersInGroup' - e.g. for group A 19-60 this would sum to 8 (the values of the 3 rows) rather than the number of 3 representing the number of rows. Any help very much appreciated.

enter image description here

The final output should look like the below - totals of CustomersInGroup grouped by Area and then CustomerAge

Area CustomerAge Totals
A 0-18 0
A 19-60 8
A >60 2
B 0-18 5
B 19-60 0
B >60 8

Solution

  • You could use summarise with sum to get the totals of CustomersInGroup. You can use the following code:

    library(dplyr)
    library(tidyr)
    library(readr)
    
    df2 <- df %>% 
      mutate(
        # Create categories
        Customer_Age_Group = dplyr::case_when(
          CustomerAge <= 18            ~ "0-18",
          CustomerAge > 18 & CustomerAge <= 60 ~ "19-60",
          CustomerAge > 60             ~ ">60"
        ))
    
    df2 %>% 
      group_by(Area, Customer_Age_Group, .drop = FALSE) %>% 
      summarise(Totals = sum(CustomersInGroup)) %>%
      ungroup() %>% 
      complete(Area, Customer_Age_Group, fill=list(Totals=0)) %>% 
      arrange(Area, parse_number(Customer_Age_Group))
    #> `summarise()` has grouped output by 'Area'. You can override using the
    #> `.groups` argument.
    #> # A tibble: 6 × 3
    #>   Area  Customer_Age_Group Totals
    #>   <chr> <chr>               <int>
    #> 1 A     0-18                    0
    #> 2 A     19-60                   8
    #> 3 A     >60                     2
    #> 4 B     0-18                    5
    #> 5 B     19-60                   0
    #> 6 B     >60                     8
    

    Created on 2022-10-09 with reprex v2.0.2