Search code examples
rdplyr

Merge rows of data frame that satisfy condition (belong to a group) by column (id)


I've got a 300x5 data frame that looks like this (3 groups, 100 rows per group, ids go from 01 to 100 per group, same id means same feat2 and feat3 but always different num_to_sum):

factor/group id num_to_sum feat2 feat3
group1 01 4 ... ...
group1 02 9 ... ...
group2 01 3 ... ...
group2 02 1 ... ...
group3 01 4 ... ...
group3 02 8 ... ...

Now, I'd like to merge all rows by id but only if they belong to a certain group. For example: I want to merge group1 and group2 by id and apply the sum operator to num_to_sum, so the final data frame (output) shall look like this:

factor/group id summed_num feat2 feat3
group12 01 (4+3=)7 ... ...
group12 02 (9+1=)10 ... ...
group3 01 4 ... ...
group3 02 8 ... ...

I don't mind about the other features (feat2 and feat3); they are actually the same for each id regardless of the group, so I could drop them, I just don't want them duplicated in the final df as that would mean duplicated data and that's messy.

It'd be great if there is a solution that can be applied to not just two groups, but to n groups that shall be merged.

I've looked at stats::aggregate, but I can't figure out how to aggregate with a condition.


Solution

  • You should create a new factor that combines the groups you want to sum, then aggregate with that and the id.

    With base R:

    tab <- read.table(text = "
    factor/group    id  num_to_sum
    group1  01  4
    group1  02  9
    group2  01  3
    group2  02  1
    group3  01  4
    group3  02  8
    ", header = TRUE)
    
    tab
    #>   factor.group id num_to_sum
    #> 1       group1  1          4
    #> 2       group1  2          9
    #> 3       group2  1          3
    #> 4       group2  2          1
    #> 5       group3  1          4
    #> 6       group3  2          8
    
    grp <- factor(tab$factor.group)
    levels(grp) <- list(group12 = c("group1", "group2"), group3 = "group3")
    grp
    #> [1] group12 group12 group12 group12 group3  group3 
    #> Levels: group12 group3
    
    aggregate(num_to_sum ~ id + grp, tab, sum)
    #>   id     grp num_to_sum
    #> 1  1 group12          7
    #> 2  2 group12         10
    #> 3  1  group3          4
    #> 4  2  group3          8
    

    With forcats and dplyr:

    library(dplyr, warn.conflicts = FALSE)
    library(forcats)
    
    tab |>
      mutate(grp = fct_collapse(factor.group, group12 = c("group1", "group2"))) |>
      group_by(grp, id) |>
      summarise(num_to_sum = sum(num_to_sum))
    #> `summarise()` has grouped output by 'grp'. You can override using the `.groups`
    #> argument.
    #> # A tibble: 4 × 3
    #> # Groups:   grp [2]
    #>   grp        id num_to_sum
    #>   <fct>   <int>      <int>
    #> 1 group12     1          7
    #> 2 group12     2         10
    #> 3 group3      1          4
    #> 4 group3      2          8