I am trying to compare two data frames in order to determine discrepancies in one of the columns. To achieve this, I am using full_join()
from tidyverse but I can't figure out how to retain the origin of the data frame so I can understand the discrepancy.
#Two databases that differ in Charge for the name and date or also have a entirely unique rows
df1 <- tibble(Name = c("JANE,DOE", "JANE,DOE", "JIM,DOE", "JANE,BUCK", "JIM,BUCK", "JIM,BUCK"),
Date = c("1/1/21", "1/10/21", "2/1/21", "1/2/21", "2/2/21", "2/8/21"),
Charge = c(-500, -500, -450, 0, -450, 0))
df2 <- tibble(Name = c("JANE,DOE", "JANE,DOE", "JIM,DOE", "JANE,BUCK", "JIM,BUCK", "JIM,BUCK", "JIM,BUCK"),
Date = c("1/2/21", "1/10/21", "2/1/21", "1/2/21", "2/2/21", "2/8/21", "2/10/21"),
Charge = c(-500, -500, -450, -500, -500, -500, -50))
I have tried to use combine them and then identify distinct rows
Audit <- full_join(df1,df2)
Audit <- Audit %>% distinct() %>% arrange(Name, Date)
But my output doesn't let me then contrast charges from rows where Name and Date are unique.
Name Date Charge
<chr> <chr> <dbl>
JANE,BUCK 1/2/21 0 #df2
JANE,BUCK 1/2/21 -500 #df1
JANE,DOE 1/1/21 -500 #df1
JANE,DOE 1/10/21 -500 #df1 & df2
JANE,DOE 1/2/21 -500 #df2
JIM,BUCK 2/10/21 -50 #df2
JIM,BUCK 2/2/21 -450 #df1
JIM,BUCK 2/2/21 -500 #df2
JIM,BUCK 2/8/21 0 #df1
JIM,BUCK 2/8/21 -500 #df2
JIM,DOE 2/1/21 -450 #df1
What I am ultimately trying to achieve is an audit the generates output like this
Name Date Charge ChargeDiff
<chr> <chr> <dbl> <dbl>
JANE,BUCK 1/2/21 0 -500 #difference in Charge when name and date are same, but Charge differs
JANE,DOE 1/1/21 -500 0 #unique df1 is 0 because we know it is valid
JANE,DOE 1/2/21 -500 -500 #unique in df2 is -500 because it is missing
JIM,BUCK 2/10/21 -50 -50 #unique in df2 is -50 because it is missing
JIM,BUCK 2/2/21 -450 50 # df1-df2 on 2/2 is -450-(-500)
JIM,BUCK 2/8/21 0 500 # df1-df2
JIM,DOE 2/1/21 -450 #df1 #unique in df1
I am having a little trouble with the step I need to get the ChangeDiff column created. Is thera a different join option that allows me to add unique rows from df2 only to Audit, but retains unique Charge values from df2 in a separate column aligned on Name and Date.
You were nearly there with the full_join()
Using the by()
argument allows you to control the join so that you can include both sets of Charge
in separate columns. Use dplyr::mutate()
and/or case_when()
to generate your ChargeDiff
column.
library(tibble)
library(dplyr)
tib <-
df1 %>%
full_join(df2, by = c("Name" = "Name", "Date" = "Date")) %>%
mutate(ChargDiff = case_when(is.na(Charge.x) | is.na(Charge.y) ~ NA_real_,
TRUE ~ Charge.x - Charge.y))
tib
#> # A tibble: 8 x 5
#> Name Date Charge.x Charge.y ChargDiff
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 JANE,DOE 1/1/21 -500 NA NA
#> 2 JANE,DOE 1/10/21 -500 -500 0
#> 3 JIM,DOE 2/1/21 -450 -450 0
#> 4 JANE,BUCK 1/2/21 0 -500 500
#> 5 JIM,BUCK 2/2/21 -450 -500 50
#> 6 JIM,BUCK 2/8/21 0 -500 500
#> 7 JANE,DOE 1/2/21 NA -500 NA
#> 8 JIM,BUCK 2/10/21 NA -50 NA
Created on 2021-03-24 by the reprex package (v1.0.0)