Search code examples
rdplyrdata.tabletidyversetidyr

How to aggregate a data frame based on the max value of the group in R


I have a large data with many groups that looks like this. I want in each group to use the fruit with the most counts as the central fruit, and aggregate the other fruits based on it!

library(tidyverse)

df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
             col2 = c("pple","app","app", "bananna", "banan", "banan"), 
             counts_col1 = c(100,100,2,200,200,2),
             counts_col2 = c(2,50,50,2,20,20),
             id=c(1,1,1,2,2,2))

df
#> # A tibble: 6 × 5
#>   col1    col2    counts_col1 counts_col2    id
#>   <chr>   <chr>         <dbl>       <dbl> <dbl>
#> 1 apple   pple            100           2     1
#> 2 apple   app             100          50     1
#> 3 pple    app               2          50     1
#> 4 banana  bananna         200           2     2
#> 5 banana  banan           200          20     2
#> 6 bananna banan             2          20     2

Created on 2022-03-16 by the reprex package (v2.0.1)

I want my data frame to look like this

id  central_fruit   fruits                 counts     sum_counts
 1     apple        apple,pple,app         100,50,2        152
 2    banana        banana,bananna,banan   200,20,2        222

The format of the output it does not have to be like this. This is just an example. It can be a list of characters or just characters. Any help or guidance is appreciated


Solution

  • We may do this by first reshaping to 'long' format (pivot_longer), grouped by 'id', 'grp', create a frequency count (add_count), then summarise the 'central_fruit' which had the max frequency by 'id', and similarly paste (toString) the unique fruit, and unique count along with sum of unique count

    library(dplyr)
    library(stringr)
    library(tidyr)
    df %>%
       rename_with(~ str_c("fruit_", .x), starts_with('col')) %>% 
       pivot_longer(cols = -id, names_to = c(".value", "grp"), 
         names_pattern = "(.*)_(col\\d+)") %>% 
       group_by(id, grp) %>%
       add_count(fruit) %>%
       group_by(id) %>% 
       summarise(central_fruit = fruit[which.max(n)], 
          fruits = toString(unique(fruit)), 
          sum_counts = sum(unique(counts)),
          counts = toString(sort(unique(counts), decreasing = TRUE)),
            .groups = 'drop' ) %>%
         relocate(counts, .before = 'sum_counts')
    

    -output

    # A tibble: 2 × 5
         id central_fruit fruits                 counts     sum_counts
      <dbl> <chr>         <chr>                  <chr>           <dbl>
    1     1 apple         apple, pple, app       100, 50, 2        152
    2     2 banana        banana, bananna, banan 200, 20, 2        222
    

    NOTE: It may be better to wrap the values of 'counts' in a list instead of pasteing. i.e. instead of counts = toString(sort(unique(counts), decreasing = TRUE)), it would be counts = list(sort(unique(counts), decreasing = TRUE))