finding close match from data frame 1 in data fame 2

I have two data frames and for each specific row in data frame one, I am trying to find its closest match from data frame two based on defined important criteria (shift, age, level). So as an example, imagine I have data frame 1:

shift_1 <- c(1,1,0,2)
length_1 <- c(100,120,5,70)
level_1<- c(1,3,5,4)
age_1 <- c(4.5,3.2,3,2.5)

df_1 <- data.frame(shift_1,level_1,age_1,length_1)

  shift_1 level_1 age_1 length_1
1       1       1   4.5      100
2       1       3   3.2      120
3       0       5   3.0        5
4       2       4   2.5       70

and I for each row of this data frame, I want to find the closest match in data frame 2 which is shown below:

shift_2 <- c(1,1,2,1,0)
length_2 <- c(100,200,40,180,10)
level_2<- c(3,4,4,3,5)
age_2 <- c(2.5,5.5,2.2,3.1,5)

df_2 <- data.frame(shift_2,level_2,age_2,length_2)

  shift_2 level_2 age_2 length_2
1       1       3   2.5      100
2       1       4   5.5      200
3       2       4   2.2       40
4       1       3   3.1      180
5       0       5   5.0       10

based on these criteria : shift must match exactly. Level must match exactly. age will be accepted with 20% difference.

if we find a match : we want to add index number of the matched row and match information, otherwise we will put NA. so the expected result is in this form:

  shift_r level_r age_r length_r index shift_match level_match age_match length_match
1       1       1   4.5      100    NA          NA          NA        NA           NA
2       1       3   3.2      120     4           1           3       3.1          180
3       0       5   3.0        5    NA          NA          NA        NA           NA
4       2       4   2.5       70     3           2           4       2.2           40

Can you please advise how I should approach this? is there any library to make this task easier?

Solution

You need a "non-equi" or "range" join. This is implemented in fuzzyjoin and data.table packages for R. Since it is also supported in SQL, one can also use sqldf.

~~Sadly, dplyr does not support this natively. Since this action is supported in SQL, if your data are in a database then dbplyr would allow it using its sql_on, but not natively.~~

Edit: adding dplyr with its recent addition of join_by.

First, we need to add in the 20% tolerance:

df_1$age_1_start <- df_1$age_1 * 0.8
df_1$age_1_end <- df_1$age_1 * 1.2
df_1
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end
# 1       1       1   4.5      100        3.60      5.40
# 2       1       3   3.2      120        2.56      3.84
# 3       0       5   3.0        5        2.40      3.60
# 4       2       4   2.5       70        2.00      3.00

fuzzyjoin

fuzzyjoin::fuzzy_left_join(
  df_1, df_2,
  by = c("shift_1" = "shift_2", "level_1" = "level_2",
         "age_1_start" = "age_2", "age_1_end" = "age_2"),
  match_fun = list(`==`, `==`, `<=`, `>=`))
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

data.table

library(data.table)
DT_1 <- as.data.table(df_1) # must include age_1_start and age_1_end from above
DT_2 <- as.data.table(df_2)

DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)]
#    shift_2 level_2 age_2 length_2 age_2.1 age_1 length_1
# 1:       1       1  3.60       NA    5.40   4.5      100
# 2:       1       3  2.56      180    3.84   3.2      120
# 3:       0       5  2.40       NA    3.60   3.0        5
# 4:       2       4  2.00       40    3.00   2.5       70

This package tends to rename the left (DT_1) join based on the right's names, which may be frustrating. For this, you will need to do some cleanup afterwards.

sqldf

sqldf::sqldf(
  "select t1.*, t2.*
   from df_1 t1
     left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
       and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2")
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
# 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
# 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
# 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
# 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40

dplyr

library(dplyr)
df_1 %>%
  mutate(
    age_1_start = age_1 * 0.8,
    age_1_end = age_1 * 1.2
  ) %>%
  left_join(df_2, join_by(shift_1 == shift_2, level_1 == level_2, 
                          age_1_start <= age_2, age_1_end >= age_2))
#   shift_1 level_1 age_1 length_1 age_1_start age_1_end age_2 length_2
# 1       1       1   4.5      100        3.60      5.40    NA       NA
# 2       1       3   3.2      120        2.56      3.84   3.1      180
# 3       0       5   3.0        5        2.40      3.60    NA       NA
# 4       2       4   2.5       70        2.00      3.00   2.2       40

If you know SQL, then the last might be the most intuitive and easiest to absorb. Keep in mind, though, that for larger frames, it is copying the entire frame into a memory-storage SQLite instance ... which is not "free".

The fuzzyjoin implementation gives you a lot of power, and its arguments seem (to me) to be easy to follow. The results are named as I would expect. However, it is the slowest (with this data) of the three implementations. (This should only be a concern if your real data is "very" large.)

If you don't already know data.table, despite its blazing speed, its dialect of R can be a bit obscure to the uninformed. I believe it has as much power as fuzzyjoin, though I haven't tested all corner-cases to see if one supports something the other does not.

bench::mark(
  fuzzyjoin = fuzzyjoin::fuzzy_left_join(
    df_1, df_2,
    by = c("shift_1" = "shift_2", "level_1" = "level_2",
           "age_1_start" = "age_2", "age_1_end" = "age_2"),
    match_fun = list(`==`, `==`, `<=`, `>=`)),
  data.table = DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)],
  sqldf = sqldf::sqldf(
    "select t1.*, t2.*
     from df_1 t1
       left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
         and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2"),
  check = FALSE
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time      gc       
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>    <list>   
# 1 fuzzyjoin  134.12ms 143.24ms      6.98     107KB     6.98     2     2      286ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 2 data.table   2.14ms   2.63ms    335.       114KB     2.06   163     1      487ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
# 3 sqldf       21.14ms  22.72ms     42.9      184KB     4.52    19     2      442ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~