I've got a 300x5 data frame that looks like this (3 groups, 100 rows per group, ids go from 01 to 100 per group, same id means same feat2 and feat3 but always different num_to_sum
):
factor/group | id | num_to_sum | feat2 | feat3 |
---|---|---|---|---|
group1 | 01 | 4 | ... | ... |
group1 | 02 | 9 | ... | ... |
group2 | 01 | 3 | ... | ... |
group2 | 02 | 1 | ... | ... |
group3 | 01 | 4 | ... | ... |
group3 | 02 | 8 | ... | ... |
Now, I'd like to merge all rows by id but only if they belong to a certain group. For example: I want to merge group1 and group2 by id and apply the sum operator to num_to_sum
, so the final data frame (output) shall look like this:
factor/group | id | summed_num | feat2 | feat3 |
---|---|---|---|---|
group12 | 01 | (4+3=)7 | ... | ... |
group12 | 02 | (9+1=)10 | ... | ... |
group3 | 01 | 4 | ... | ... |
group3 | 02 | 8 | ... | ... |
I don't mind about the other features (feat2 and feat3); they are actually the same for each id regardless of the group, so I could drop them, I just don't want them duplicated in the final df as that would mean duplicated data and that's messy.
It'd be great if there is a solution that can be applied to not just two groups, but to n
groups that shall be merged.
I've looked at stats::aggregate
, but I can't figure out how to aggregate with a condition.
You should create a new factor that combines the groups you want to sum, then aggregate with that and the id.
With base R:
tab <- read.table(text = "
factor/group id num_to_sum
group1 01 4
group1 02 9
group2 01 3
group2 02 1
group3 01 4
group3 02 8
", header = TRUE)
tab
#> factor.group id num_to_sum
#> 1 group1 1 4
#> 2 group1 2 9
#> 3 group2 1 3
#> 4 group2 2 1
#> 5 group3 1 4
#> 6 group3 2 8
grp <- factor(tab$factor.group)
levels(grp) <- list(group12 = c("group1", "group2"), group3 = "group3")
grp
#> [1] group12 group12 group12 group12 group3 group3
#> Levels: group12 group3
aggregate(num_to_sum ~ id + grp, tab, sum)
#> id grp num_to_sum
#> 1 1 group12 7
#> 2 2 group12 10
#> 3 1 group3 4
#> 4 2 group3 8
With forcats and dplyr:
library(dplyr, warn.conflicts = FALSE)
library(forcats)
tab |>
mutate(grp = fct_collapse(factor.group, group12 = c("group1", "group2"))) |>
group_by(grp, id) |>
summarise(num_to_sum = sum(num_to_sum))
#> `summarise()` has grouped output by 'grp'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 × 3
#> # Groups: grp [2]
#> grp id num_to_sum
#> <fct> <int> <int>
#> 1 group12 1 7
#> 2 group12 2 10
#> 3 group3 1 4
#> 4 group3 2 8