Search code examples
rdplyrinner-joinfuzzyjoin

"fuzzy" inner_join in dplyr to keep both rows that do AND not exactly match


I am working with two datasets that I would like to join based not exact matches between them, but rather approximate matches. My question is similar to this OP.

Here are examples of what my two dataframes look like.

df1 is this one:

 x
 4.8
 12  
 4  
 3.5
 12.5
 18  

df2 is this one:

 x     y
 4.8   6.6
 12    1  
 4.5   1  
 3.5   0.5
 13    1.8
 15    2

I am currently using inner_join(df1, df2, by=c("x") to join the two together.

This gives me:

 x     y
 4.8   6.6
 12    1
 3.5   0.5

However, what I really want to do is join the two dfs based on these conditions:

  1. any exact matches are joined first (exactly like how inner_join() currently works)
  2. BUT, if there are no exact matches, then join to any match ± 0.5

The kind of output I am trying to get would look like this:

 x     y
 4.8   6.6
 12    1
 4     1    #the y value is from x=4.5 in df1
 4     0.5  #the y value is from x=3.5 in df1
 3.5   0.5
 12.5  1    #the y value is from x=12 in df1
 12.5  1.8  #the y value is from x=13 in df1 

I typically work in dplyr, so a dplyr solution would be appreciated. But, I am also open to other suggestions because I don't know if dplyr will be flexible enough to do a "fuzzy" join.

(I am aware of the fuzzyjoin package, but it doesn't seem to get at exactly what I am trying to do here)


Solution

  • A possible solution, with no join:

    library(tidyverse)
    
    df1 %>%
      rename(x1 = x) %>%
      crossing(df2) %>%
      mutate(diff = abs(x1-x)) %>% 
      filter(diff <= 0.5) %>% 
      group_by(x1) %>% 
      mutate(aux = any(diff == 0)) %>% 
      filter(aux*(diff == 0) | !aux) %>% 
      select(-diff, -aux) %>% 
      ungroup
    
    #> # A tibble: 7 × 3
    #>      x1     x     y
    #>   <dbl> <dbl> <dbl>
    #> 1   3.5   3.5   0.5
    #> 2   4     3.5   0.5
    #> 3   4     4.5   1  
    #> 4   4.8   4.8   6.6
    #> 5  12    12     1  
    #> 6  12.5  12     1  
    #> 7  12.5  13     1.8