Search code examples
rjoindplyr

Joining two data frames to replace values in one with another


I have two data frames that look like these:

df1 <- data.frame("Year" = c(1991, 1997, 2002, 2009, 2016, 2024), 
              "Name" = c("Amy", "Ben", "Chad", "Dan", "Emily", "Frank"), 
              "Gender" = c("F", "M", "M", "M", "F", "M"))
df2 <- data.frame("Year" = c(1997, 1997, 1997, 2009, 2009, 2016, 2016), 
              "Name" = c("Ben", "Ben", "Ben", "Dan", "Dan", "Emily", "Emily"), 
              "Gender" = c("M", "M", "M", "M", "M", "F", "F"), 
              "Year_new" = c(1998, 2002, 2004, 2009, 2010, 2017, 2020), 
              "Name_new" = c("Bella", "Brandon", "Brittany", "David", "Dylan", "Emma", "Ethan"), 
              "Gender_new" = c("F", "M", "F", "M", "M", "F", "M"))

I want to merge df1 and df2 to replace Year, Name and Gender that overlap between df1 and df2 into Year_new, Name_new and Gender_new, so that the result will look like this:

df3 <- data.frame("Year" = c(1991, 1998, 2002, 2004, 2002, 2009, 2010, 2017, 2020, 2024), 
              "Name" = c("Amy", "Bella", "Brandon", "Brittany", "Chad", "David", "Dylan", "Emma", "Ethan", "Frank"), 
              "Gender" = c("F", "F", "M", "F", "M", "M", "M", "F", "M", "M"))

I tried doing this by using full_join, coalesce and rename like below.

df3 <- full_join(df1, df2) %>% 
  mutate("Year_new" = coalesce(Year, Year_new), 
             "Name_new" = coalesce(Name, Name_new), 
             "Gender_new" = coalesce(Gender, Gender_new)) %>% 
  select(Year_new:Gender_new) %>% 
  rename(Year = Year_new, Name = Name_new, Gender = Gender_new)

But this way, coalesce is overwriting the values I'd like to keep from df2. I also feel like this might not be the easiest/most efficient way. Is there a function other than full_join or coalesce that would work in this case?


Solution

  • Bind what's only in df1 with what's new in df2 (properly renamed):

    library(dplyr)
    
    # You don't really need this "column-selection" helper but..
    aux_cols <- c("Year", "Name", "Gender")
    
    # Bind what's only in `df1`..
    # ..with what´s "new" in `df2` (renamed withou "_new) 
    new_df <- bind_rows(
      anti_join(df1, df2, by = aux_cols),
    
      transmute(
        semi_join(df2, df1, by = aux_cols),
        across(ends_with("_new"), .names = "{str_remove({.col}, '_new')}")))
    

    The output:

    # `arrange` the output by `Year and `Name` 
    > arrange(new_df, Year, Name)
       Year     Name Gender
    1  1991      Amy      F
    2  1998    Bella      F
    3  2002  Brandon      M
    4  2002     Chad      M
    5  2004 Brittany      F
    6  2009    David      M
    7  2010    Dylan      M
    8  2017     Emma      F
    9  2020    Ethan      M
    10 2024    Frank      M
    

    Edit. If you're not into tidyverse that much, you can try this:

    aux_cols <- c("Year", "Name", "Gender")
    
    df3 <- merge(df1, df2, all.x = TRUE)
    
    aux_rows <- complete.cases(df3)
    aux_cols <- c("Year", "Name", "Gender")
    
    df3[aux_rows, aux_cols] <- df3[aux_rows, paste0(aux_cols, "_new")]
    df3 <- df3[do.call(order, df3[aux_cols]), aux_cols]
    

    The output:

    > df3
       Year     Name Gender
    1  1991      Amy      F
    2  1998    Bella      F
    3  2002  Brandon      M
    5  2002     Chad      M
    4  2004 Brittany      F
    6  2009    David      M
    7  2010    Dylan      M
    8  2017     Emma      F
    9  2020    Ethan      M
    10 2024    Frank      M