Search code examples
rdplyr

Compare Two DataFrames and Replace Differences in One with the Values from the Other


Suppose I have two dataframes:

df1 <- data.frame(
    Name = c("Alice", "Bob", "Charlie"),
    Age = c(25, 30, 22),
    City = c("New York", "San Francisco", "Los Angeles"),
    Score = c(95, 88, 92)
)

df2 <- data.frame(
    Name = c("Alice", "Bob", "Charlie"),
    Age = c(25, 30, 22),
    City = c("New York", "San Francisco", "Los Angeles"),
    Score = c(95, 100, 92)
)

And I've noticed that Bob has changed his score (come on Bob!).

I want to find the differences between the two dataframes and when there is a difference, keep the values across all possibly different columns from the 1st dataframe.

I'm able to find the differences:

differences <- df1 != df2

But I'm unsure where to go from here. Ultimately, I need a dataframe where anything that is the same (or FALSE in the differences df) is NA and only the things that are different are in the dataframe. It would look something like:

fixed_df <- data.frame(
    Name = c(NA_character_, NA_character_, NA_character_),
    Age = c(NA_real_, NA_real_, NA_real_),
    City = c(NA_character_, NA_character_, NA_character_),
    Score = c(NA_real_, 100, NA_real_)
)

Thanks for your help!


Solution

  • use the forgotten function is.na<-

    `is.na<-`(df2, df1 == df2)
      Name Age City Score
    1 <NA>  NA <NA>    NA
    2 <NA>  NA <NA>   100
    3 <NA>  NA <NA>    NA
    

    you could use is.na but does an inplace replacement. Therefore we need a temp dataset so as to maintain the original:

    temp <- df2
    is.na(temp) <- df2 == df1
    temp
    
      Name Age City Score
    1 <NA>  NA <NA>    NA
    2 <NA>  NA <NA>   100
    3 <NA>  NA <NA>    NA