Search code examples
rdataframedplyrfuzzyjoinanti-join

Return anti-join of two data frames with values outside a certain percentage difference


I would like to compare two mixed-type data frames and return the rows that are different between them--but I would like numeric values to only be returned within a certain percentage.

tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
               var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
               var3 = c(1, 2, 3, 4, 5),
               var4 = c('yes', 'no', 'yes', 'yes', 'no'))

tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
               var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
               var3 = c(2, 3, 1.5, 10, 4.1),
               var4 = c('no', 'yes', 'yes', 'no', 'yes'))

I know there is dplyr::anti_join but that checks for exact matches. So if I was OK with numeric values that were within 20%, then the function would be something like:

tbl1 %>%
  antijoin_function(tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'),
                    pct = 0.2)

And return

var1 var2 var3 var4
r1 apple 1 yes
r5 lime 5 no

The row with strawberry would not be returned because the single difference in var3 is less than 20%.

Are there any functions or packages that do this?


Solution

  • library(dplyr)
    
    full_join(tbl1, tbl2, by = c("var2" = "var2"), suffix = c("", ".right")) %>% 
      filter(abs(var3 - var3.right)/var3 > 0.2 | if_all(contains(".right"), ~ is.na(.))) %>% 
      select(-contains(".right"))
    
    #> # A tibble: 2 × 4
    #>   var1  var2   var3 var4 
    #>   <chr> <chr> <dbl> <chr>
    #> 1 r1    apple     1 yes  
    #> 2 r5    lime      5 no
    

    Created on 2023-05-22 with reprex v2.0.2