I have data as follows:
input_A <- data.frame(ID = c(1,2), some_var = c("bla","more bla"))
input_B <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
2, 2), year = c(2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002,
2003, 2001, 2002, 2003, 2001, 2002, 2003), Type = c("A", "A",
"A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "C", "C", "C"
), Subtype = c(2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Value = c(0.480513615083894, 0.909788893002047, 0.685141970365005,
0.138835747632889, 0.899508237239289, 0.535632890739584,
0.0712054637209442, 0.655905506366812, 0.694753916517691,
0.469249523993816, 0.295044859429007, 0.209906890342936,
0.193574644156237, 0.0715219759792846, 0.626529278499682)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
# A tibble: 15 × 5
ID year Type Subtype Value
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 2001 A 2 0.481
2 1 2002 A 2 0.910
3 1 2003 A 2 0.685
4 1 2001 B 1 0.139
5 1 2002 B 1 0.900
6 1 2003 B 1 0.536
7 2 2001 A 1 0.0712
8 2 2002 A 1 0.656
9 2 2003 A 1 0.695
10 2 2001 B 1 0.469
11 2 2002 B 1 0.295
12 2 2003 B 1 0.210
13 2 2001 C 2 0.194
14 2 2002 C 2 0.0715
15 2 2003 C 2 0.627
I would like to merge this data. However:
ID + year
has multiple matches, this is not an option for me.dcast
is not an option.What I would like to do, is to merge the first match by ID + year
(PREFERABLY THE ONE WITH THE HIGHEST VALUE), then merge the second match by ID + year
until there are no more matches.
I thought I could use this answer:
inputA[inputB, mult = "first", on = "ID", nomatch=0L]
remove the first matching rows from inputB
and run it again until there are no more merges.
It seems a bit messy though and I was wondering if there is a better solution. What would be the best way to go about this?
Desired output:
output <- structure(list(ID = c(1, 1, 1, 2, 2, 2), some_var = c("bla",
"bla", "bla", "more bla", "more bla", "more bla"), year = c(2001,
2002, 2003, 2001, 2002, 2003), Type_1 = c("A", "A", "A", "A",
"A", "A"), Subtype_1 = c(2, 2, 2, 1, 1, 1), Value_1 = c(0.480513615083894,
0.909788893002047, 0.685141970365005, 0.0712054637209442, 0.655905506366812,
0.694753916517691), Type_2 = c("B", "B", "B", "B", "B", "B"),
Subtype_2 = c(1, 1, 1, 1, 1, 1), Value_2 = c(0.138835747632889,
0.899508237239289, 0.535632890739584, 0.469249523993816,
0.295044859429007, 0.209906890342936), Type_3 = c(NA, NA,
NA, "C", "C", "C"), Subtype_3 = c(NA, NA, NA, 2, 2, 2), Value_3 = c(NA,
NA, NA, 0.193574644156237, 0.0715219759792846, 0.626529278499682
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
# A tibble: 6 × 12
ID some_var year Type_1 Subtype_1 Value_1 Type_2 Subtype_2 Value_2 Type_3 Subtype_3 Value_3
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 bla 2001 A 2 0.481 B 1 0.139 NA NA NA
2 1 bla 2002 A 2 0.910 B 1 0.900 NA NA NA
3 1 bla 2003 A 2 0.685 B 1 0.536 NA NA NA
4 2 more bla 2001 A 1 0.0712 B 1 0.469 C 2 0.194
5 2 more bla 2002 A 1 0.656 B 1 0.295 C 2 0.0715
6 2 more bla 2003 A 1 0.695 B 1 0.210 C 2 0.627
We can use pivot_wider
with left_join
library(tidyr)
library(dplyr)
library(data.table)
input_B %>%
mutate(rn = rowid(ID, year)) %>%
pivot_wider(names_from = rn, values_from = c(Type, Subtype, Value)) %>%
left_join(input_A)
-output
# A tibble: 6 × 12
ID year Type_1 Type_2 Type_3 Subtype_1 Subtype_2 Subtype_3 Value_1 Value_2 Value_3 some_var
<dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2001 A B <NA> 2 1 NA 0.481 0.139 NA bla
2 1 2002 A B <NA> 2 1 NA 0.910 0.900 NA bla
3 1 2003 A B <NA> 2 1 NA 0.685 0.536 NA bla
4 2 2001 A B C 1 1 2 0.0712 0.469 0.194 more bla
5 2 2002 A B C 1 1 2 0.656 0.295 0.0715 more bla
6 2 2003 A B C 1 1 2 0.695 0.210 0.627 more bla