Search code examples
rdataframedplyrmergemerging-data

Merging an R Data.Frame based on different columns depending on row


I have two data frames that look like: Df1: `

    ID  ID_T1   ID_T2   ID_T3   ID_T4
1   101 NA  NA  NA  NA
2   102 11111   11111   11111   11111
3   103 22222   22222   NA  NA
4   104 NA  NA  NA  NA
5   105 44444   44445   NA  NA

Note. NAs included and IDs sometimes differ across rows.

Df2 (or better a vector)

    ID_T5
1   44445
2   11111
3   22222
4   99945

Note. Newly assigned IDs present.

I need the following output

    ID  ID_T1   ID_T2   ID_T3   ID_T4   ID_T5
1   101 NA  NA  NA  NA  NA
2   102 11111   11111   11111   11111   11111
3   103 22222   22222   NA  NA  22222
4   104 NA  NA  NA  NA  NA
5   105 44444   44445   NA  NA  44445
6   NA  NA  NA  NA  NA  99945

So the package needs to check across ID_T1—ID_T4 whether the value is already existing and add it as a separate column and add additional rows if the value is not present in the df1.

I would really appreciate your help.

I have already spent hours with dplyr and messed around with filter and mutate, without being successful.


Solution

  • A bit clumsy, but it works:

    library(dplyr)
    
    Df3 <- cross_join(Df1, Df2) |>
      filter(ID_T5==ID_T1 | ID_T5==ID_T2 | ID_T5==ID_T3 | ID_T5==ID_T4)
    
    bind_rows(Df3, anti_join(Df1, Df3)) |>
      full_join(Df2) |>
      arrange(ID)
    

       ID ID_T1 ID_T2 ID_T3 ID_T4 ID_T5
    1 101    NA    NA    NA    NA    NA
    2 102 11111 11111 11111 11111 11111
    3 103 22222 22222    NA    NA 22222
    4 104    NA    NA    NA    NA    NA
    5 105 44444 44445    NA    NA 44445
    6  NA    NA    NA    NA    NA 99945
    

    Data:

    Df1:
    structure(list(ID = 101:105, ID_T1 = c(NA, 11111, 22222, NA, 
    44444), ID_T2 = c(NA, 11111, 22222, NA, 44445), ID_T3 = c(NA, 
    11111, NA, NA, NA), ID_T4 = c(NA, 11111, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
    -5L))
    
    Df2:
    structure(list(ID_T5 = c(44445, 11111, 22222, 99945)), class = "data.frame", row.names = c(NA, 
    -4L))