Search code examples
rmergecbinddcast

Concatenating matches in a merge with multiple matches


I have data as follows:

input_A <- data.frame(ID = c(1,2), some_var = c("bla","more bla"))

input_B <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
2, 2), year = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 
2003, 2001, 2002, 2003, 2001, 2002, 2003), Type = c("A", "A", 
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "C", "C", "C"
), Subtype = c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
    Value = c(0.480513615083894, 0.909788893002047, 0.685141970365005, 
    0.138835747632889, 0.899508237239289, 0.535632890739584, 
    0.0712054637209442, 0.655905506366812, 0.694753916517691, 
    0.469249523993816, 0.295044859429007, 0.209906890342936, 
    0.193574644156237, 0.0715219759792846, 0.626529278499682)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -15L))

# A tibble: 15 × 5
      ID  year Type  Subtype  Value
   <dbl> <dbl> <chr>   <dbl>  <dbl>
 1     1  2001 A           2 0.481 
 2     1  2002 A           2 0.910 
 3     1  2003 A           2 0.685 
 4     1  2001 B           1 0.139 
 5     1  2002 B           1 0.900 
 6     1  2003 B           1 0.536 
 7     2  2001 A           1 0.0712
 8     2  2002 A           1 0.656 
 9     2  2003 A           1 0.695 
10     2  2001 B           1 0.469 
11     2  2002 B           1 0.295 
12     2  2003 B           1 0.210 
13     2  2001 C           2 0.194 
14     2  2002 C           2 0.0715
15     2  2003 C           2 0.627

I would like to merge this data. However:

  • Since the combination of ID + year has multiple matches, this is not an option for me.
  • Since the combinations of type and subtype in my actual data are really large, also dcast is not an option.

What I would like to do, is to merge the first match by ID + year (PREFERABLY THE ONE WITH THE HIGHEST VALUE), then merge the second match by ID + year until there are no more matches.

I thought I could use this answer:

inputA[inputB, mult = "first", on = "ID", nomatch=0L]

remove the first matching rows from inputB and run it again until there are no more merges.

It seems a bit messy though and I was wondering if there is a better solution. What would be the best way to go about this?

Desired output:

output <- structure(list(ID = c(1, 1, 1, 2, 2, 2), some_var = c("bla", 
"bla", "bla", "more bla", "more bla", "more bla"), year = c(2001, 
2002, 2003, 2001, 2002, 2003), Type_1 = c("A", "A", "A", "A", 
"A", "A"), Subtype_1 = c(2, 2, 2, 1, 1, 1), Value_1 = c(0.480513615083894, 
0.909788893002047, 0.685141970365005, 0.0712054637209442, 0.655905506366812, 
0.694753916517691), Type_2 = c("B", "B", "B", "B", "B", "B"), 
    Subtype_2 = c(1, 1, 1, 1, 1, 1), Value_2 = c(0.138835747632889, 
    0.899508237239289, 0.535632890739584, 0.469249523993816, 
    0.295044859429007, 0.209906890342936), Type_3 = c(NA, NA, 
    NA, "C", "C", "C"), Subtype_3 = c(NA, NA, NA, 2, 2, 2), Value_3 = c(NA, 
    NA, NA, 0.193574644156237, 0.0715219759792846, 0.626529278499682
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

# A tibble: 6 × 12
     ID some_var  year Type_1 Subtype_1 Value_1 Type_2 Subtype_2 Value_2 Type_3 Subtype_3 Value_3
  <dbl> <chr>    <dbl> <chr>      <dbl>   <dbl> <chr>      <dbl>   <dbl> <chr>      <dbl>   <dbl>
1     1 bla       2001 A              2  0.481  B              1   0.139 NA            NA NA     
2     1 bla       2002 A              2  0.910  B              1   0.900 NA            NA NA     
3     1 bla       2003 A              2  0.685  B              1   0.536 NA            NA NA     
4     2 more bla  2001 A              1  0.0712 B              1   0.469 C              2  0.194 
5     2 more bla  2002 A              1  0.656  B              1   0.295 C              2  0.0715
6     2 more bla  2003 A              1  0.695  B              1   0.210 C              2  0.627 

Solution

  • We can use pivot_wider with left_join

    library(tidyr)
    library(dplyr)
    library(data.table)
    input_B %>% 
      mutate(rn = rowid(ID, year)) %>%
      pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
      left_join(input_A)
    

    -output

    # A tibble: 6 × 12
         ID  year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
      <dbl> <dbl> <chr>  <chr>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>   <dbl>   <dbl> <chr>   
    1     1  2001 A      B      <NA>           2         1        NA  0.481    0.139 NA      bla     
    2     1  2002 A      B      <NA>           2         1        NA  0.910    0.900 NA      bla     
    3     1  2003 A      B      <NA>           2         1        NA  0.685    0.536 NA      bla     
    4     2  2001 A      B      C              1         1         2  0.0712   0.469  0.194  more bla
    5     2  2002 A      B      C              1         1         2  0.656    0.295  0.0715 more bla
    6     2  2003 A      B      C              1         1         2  0.695    0.210  0.627  more bla