Search code examples
rdataframedplyrcombinations

Create combinations by group and sum


I have data of names within an ID number along with a number of associated values. It looks something like this:

structure(list(id = c("a", "a", "b", "b"), name = c("bob", "jane", 
"mark", "brittney"), number = c(1L, 2L, 1L, 2L), value = c(1L, 
2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -4L))

#   id     name number value
# 1  a      bob      1     1
# 2  a     jane      2     2
# 3  b     mark      1     1
# 4  b brittney      2     2

I would like to create all the combinations of name, regardless of how many there are, and paste them together separated with commas, and sum their number and value within each id. The desired output from the example above is then:

structure(list(id = c("a", "a", "a", "b", "b", "b"), name = c("bob", 
"jane", "bob, jane", "mark", "brittney", "mark, brittney"), number = c(1L, 
2L, 3L, 1L, 2L, 3L), value = c(1L, 2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -6L))

#   id           name number value
# 1  a            bob      1     1
# 2  a           jane      2     2
# 3  a      bob, jane      3     3
# 4  b           mark      1     1
# 5  b       brittney      2     2
# 6  b mark, brittney      3     3

Thanks all!


Solution

  • You can create pairwise indices using combn() and expand the data frame with these using slice(). Then just group by these row pairs and summarise. I'm assuming you want pairwise combinations but this can be adapted for larger sets if needed. Some code to handle groups < 2 is included but can be removed if these don't exist in your data.

    library(dplyr)
    library(purrr)
    
    df1 %>%
      group_by(id) %>%
      slice(c(combn(seq(n()), min(n(), 2)))) %>%
      mutate(id2 = (row_number()-1) %/% 2) %>%
      group_by(id, id2) %>%
      summarise(name = toString(name),
                across(where(is.numeric), sum), .groups = "drop") %>%
      select(-id2) %>%
      bind_rows(df1 %>%
                  group_by(id) %>%
                  filter(n() > 1), .) %>%
      arrange(id) %>%
      ungroup()
    
    # A tibble: 6 × 4
      id    name           number value
      <chr> <chr>           <int> <int>
    1 a     bob                 1     1
    2 a     jane                2     2
    3 a     bob, jane           3     3
    4 b     mark                1     1
    5 b     brittney            2     2
    6 b     mark, brittney      3     3
    

    Edit:

    To adapt for all possible combinations you can iterate over the values up to the max group size. Using edited data which has a couple of rows added to the first group:

    map_df(seq(max(table(df2$id))), ~
             df2 %>%
             group_by(id) %>%
             slice(c(combn(seq(n()), .x * (.x <= n())))) %>%
             mutate(id2 = (row_number() - 1) %/% .x) %>%
             group_by(id, id2) %>%
             summarise(name = toString(name),
                       across(where(is.numeric), sum), .groups = "drop")
           ) %>%
      select(-id2) %>%
      arrange(id)
    
    # A tibble: 18 × 4
       id    name                      number value
       <chr> <chr>                      <int> <int>
     1 a     bob                            1     1
     2 a     jane                           2     2
     3 a     sophie                         1     1
     4 a     jeremy                         2     2
     5 a     bob, jane                      3     3
     6 a     bob, sophie                    2     2
     7 a     bob, jeremy                    3     3
     8 a     jane, sophie                   3     3
     9 a     jane, jeremy                   4     4
    10 a     sophie, jeremy                 3     3
    11 a     bob, jane, sophie              4     4
    12 a     bob, jane, jeremy              5     5
    13 a     bob, sophie, jeremy            4     4
    14 a     jane, sophie, jeremy           5     5
    15 a     bob, jane, sophie, jeremy      6     6
    16 b     mark                           3     5
    17 b     brittney                       4     6
    18 b     mark, brittney                 7    11
    

    Data for df2:

    df2 <- structure(list(id = c("a", "a", "a", "a", "b", "b"), name = c("bob", 
                                                                         "jane", "sophie", "jeremy", "mark", "brittney"), number = c(1L, 
                                                                                                                                     2L, 1L, 2L, 3L, 4L), value = c(1L, 2L, 1L, 2L, 5L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                  -6L))