Search code examples
rdataframemergetidyverseaudit

How do you retain original column for comparison in full_join() of two databases in r


I am trying to compare two data frames in order to determine discrepancies in one of the columns. To achieve this, I am using full_join() from tidyverse but I can't figure out how to retain the origin of the data frame so I can understand the discrepancy.

#Two databases that differ in Charge for the name and date or also have a entirely unique rows

df1 <- tibble(Name = c("JANE,DOE", "JANE,DOE", "JIM,DOE", "JANE,BUCK", "JIM,BUCK", "JIM,BUCK"),
Date = c("1/1/21", "1/10/21", "2/1/21", "1/2/21", "2/2/21", "2/8/21"),
Charge = c(-500, -500, -450, 0, -450, 0))

df2 <- tibble(Name = c("JANE,DOE", "JANE,DOE", "JIM,DOE", "JANE,BUCK", "JIM,BUCK", "JIM,BUCK", "JIM,BUCK"),
Date = c("1/2/21", "1/10/21", "2/1/21", "1/2/21", "2/2/21", "2/8/21", "2/10/21"),
Charge = c(-500, -500, -450, -500, -500, -500, -50))

I have tried to use combine them and then identify distinct rows

Audit <- full_join(df1,df2)
Audit <- Audit %>% distinct() %>% arrange(Name, Date)

But my output doesn't let me then contrast charges from rows where Name and Date are unique.

Name       Date    Charge
<chr>      <chr>   <dbl>
JANE,BUCK   1/2/21  0       #df2
JANE,BUCK   1/2/21  -500    #df1    
JANE,DOE    1/1/21  -500    #df1    
JANE,DOE    1/10/21 -500    #df1 & df2  
JANE,DOE    1/2/21  -500    #df2    
JIM,BUCK    2/10/21 -50     #df2
JIM,BUCK    2/2/21  -450    #df1    
JIM,BUCK    2/2/21  -500    #df2    
JIM,BUCK    2/8/21  0       #df1
JIM,BUCK    2/8/21  -500    #df2    
JIM,DOE     2/1/21  -450    #df1

What I am ultimately trying to achieve is an audit the generates output like this

Name       Date    Charge   ChargeDiff
<chr>      <chr>   <dbl>     <dbl>
JANE,BUCK   1/2/21  0       -500 #difference in Charge when name and date are same, but Charge differs
    
JANE,DOE    1/1/21  -500      0  #unique df1 is 0 because we know it is valid

JANE,DOE    1/2/21  -500    -500 #unique in df2 is -500 because it is missing   
JIM,BUCK    2/10/21 -50     -50  #unique in df2 is -50 because it is missing
JIM,BUCK    2/2/21  -450     50  # df1-df2 on 2/2 is -450-(-500) 

JIM,BUCK    2/8/21  0       500  # df1-df2

JIM,DOE     2/1/21  -450    #df1 #unique in df1

I am having a little trouble with the step I need to get the ChangeDiff column created. Is thera a different join option that allows me to add unique rows from df2 only to Audit, but retains unique Charge values from df2 in a separate column aligned on Name and Date.


Solution

  • You were nearly there with the full_join() Using the by() argument allows you to control the join so that you can include both sets of Charge in separate columns. Use dplyr::mutate() and/or case_when() to generate your ChargeDiff column.

    library(tibble)
    library(dplyr)
    tib <- 
      df1 %>% 
      full_join(df2, by = c("Name" = "Name", "Date" = "Date")) %>% 
      mutate(ChargDiff = case_when(is.na(Charge.x) | is.na(Charge.y) ~ NA_real_,
                                   TRUE ~ Charge.x - Charge.y))
    
    tib
    #> # A tibble: 8 x 5
    #>   Name      Date    Charge.x Charge.y ChargDiff
    #>   <chr>     <chr>      <dbl>    <dbl>     <dbl>
    #> 1 JANE,DOE  1/1/21      -500       NA        NA
    #> 2 JANE,DOE  1/10/21     -500     -500         0
    #> 3 JIM,DOE   2/1/21      -450     -450         0
    #> 4 JANE,BUCK 1/2/21         0     -500       500
    #> 5 JIM,BUCK  2/2/21      -450     -500        50
    #> 6 JIM,BUCK  2/8/21         0     -500       500
    #> 7 JANE,DOE  1/2/21        NA     -500        NA
    #> 8 JIM,BUCK  2/10/21       NA      -50        NA
    

    Created on 2021-03-24 by the reprex package (v1.0.0)