Search code examples
rdataframedplyrpattern-matching

finding close match from data frame 1 in data fame 2


I have two data frames and for each specific row in data frame one, I am trying to find its closest match from data frame two based on defined important criteria (shift, age, level). So as an example, imagine I have data frame 1:

shift_1 <- c(1,1,0,2)
length_1 <- c(100,120,5,70)
level_1<- c(1,3,5,4)
age_1 <- c(4.5,3.2,3,2.5)

df_1 <- data.frame(shift_1,level_1,age_1,length_1)

  shift_1 level_1 age_1 length_1
1       1       1   4.5      100
2       1       3   3.2      120
3       0       5   3.0        5
4       2       4   2.5       70

and I for each row of this data frame, I want to find the closest match in data frame 2 which is shown below:

shift_2 <- c(1,1,2,1,0)
length_2 <- c(100,200,40,180,10)
level_2<- c(3,4,4,3,5)
age_2 <- c(2.5,5.5,2.2,3.1,5)

df_2 <- data.frame(shift_2,level_2,age_2,length_2)

  shift_2 level_2 age_2 length_2
1       1       3   2.5      100
2       1       4   5.5      200
3       2       4   2.2       40
4       1       3   3.1      180
5       0       5   5.0       10

based on these criteria : shift must match exactly. Level must match exactly. age will be accepted with 20% difference.

if we find a match : we want to add index number of the matched row and match information, otherwise we will put NA. so the expected result is in this form:

  shift_r level_r age_r length_r index shift_match level_match age_match length_match
1       1       1   4.5      100    NA          NA          NA        NA           NA
2       1       3   3.2      120     4           1           3       3.1          180
3       0       5   3.0        5    NA          NA          NA        NA           NA
4       2       4   2.5       70     3           2           4       2.2           40

Can you please advise how I should approach this? is there any library to make this task easier?


Solution

  • You need a "non-equi" or "range" join. This is implemented in fuzzyjoin and data.table packages for R. Since it is also supported in SQL, one can also use sqldf.

    Sadly, dplyr does not support this natively. Since this action is supported in SQL, if your data are in a database then dbplyr would allow it using its sql_on, but not natively.

    Edit: adding dplyr with its recent addition of join_by.

    First, we need to add in the 20% tolerance:

    df_1$age_1_start <- df_1$age_1 * 0.8
    df_1$age_1_end <- df_1$age_1 * 1.2
    df_1
    #   shift_1 level_1 age_1 length_1 age_1_start age_1_end
    # 1       1       1   4.5      100        3.60      5.40
    # 2       1       3   3.2      120        2.56      3.84
    # 3       0       5   3.0        5        2.40      3.60
    # 4       2       4   2.5       70        2.00      3.00
    

    fuzzyjoin

    fuzzyjoin::fuzzy_left_join(
      df_1, df_2,
      by = c("shift_1" = "shift_2", "level_1" = "level_2",
             "age_1_start" = "age_2", "age_1_end" = "age_2"),
      match_fun = list(`==`, `==`, `<=`, `>=`))
    #   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
    # 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
    # 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
    # 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
    # 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40
    

    data.table

    library(data.table)
    DT_1 <- as.data.table(df_1) # must include age_1_start and age_1_end from above
    DT_2 <- as.data.table(df_2)
    
    DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)]
    #    shift_2 level_2 age_2 length_2 age_2.1 age_1 length_1
    # 1:       1       1  3.60       NA    5.40   4.5      100
    # 2:       1       3  2.56      180    3.84   3.2      120
    # 3:       0       5  2.40       NA    3.60   3.0        5
    # 4:       2       4  2.00       40    3.00   2.5       70
    

    This package tends to rename the left (DT_1) join based on the right's names, which may be frustrating. For this, you will need to do some cleanup afterwards.

    sqldf

    sqldf::sqldf(
      "select t1.*, t2.*
       from df_1 t1
         left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
           and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2")
    #   shift_1 level_1 age_1 length_1 age_1_start age_1_end shift_2 level_2 age_2 length_2
    # 1       1       1   4.5      100        3.60      5.40      NA      NA    NA       NA
    # 2       1       3   3.2      120        2.56      3.84       1       3   3.1      180
    # 3       0       5   3.0        5        2.40      3.60      NA      NA    NA       NA
    # 4       2       4   2.5       70        2.00      3.00       2       4   2.2       40
    

    dplyr

    library(dplyr)
    df_1 %>%
      mutate(
        age_1_start = age_1 * 0.8,
        age_1_end = age_1 * 1.2
      ) %>%
      left_join(df_2, join_by(shift_1 == shift_2, level_1 == level_2, 
                              age_1_start <= age_2, age_1_end >= age_2))
    #   shift_1 level_1 age_1 length_1 age_1_start age_1_end age_2 length_2
    # 1       1       1   4.5      100        3.60      5.40    NA       NA
    # 2       1       3   3.2      120        2.56      3.84   3.1      180
    # 3       0       5   3.0        5        2.40      3.60    NA       NA
    # 4       2       4   2.5       70        2.00      3.00   2.2       40
    

    If you know SQL, then the last might be the most intuitive and easiest to absorb. Keep in mind, though, that for larger frames, it is copying the entire frame into a memory-storage SQLite instance ... which is not "free".

    The fuzzyjoin implementation gives you a lot of power, and its arguments seem (to me) to be easy to follow. The results are named as I would expect. However, it is the slowest (with this data) of the three implementations. (This should only be a concern if your real data is "very" large.)

    If you don't already know data.table, despite its blazing speed, its dialect of R can be a bit obscure to the uninformed. I believe it has as much power as fuzzyjoin, though I haven't tested all corner-cases to see if one supports something the other does not.

    bench::mark(
      fuzzyjoin = fuzzyjoin::fuzzy_left_join(
        df_1, df_2,
        by = c("shift_1" = "shift_2", "level_1" = "level_2",
               "age_1_start" = "age_2", "age_1_end" = "age_2"),
        match_fun = list(`==`, `==`, `<=`, `>=`)),
      data.table = DT_2[DT_1, on = .(shift_2 == shift_1, level_2 == level_1, age_2 >= age_1_start, age_2 <= age_1_end)],
      sqldf = sqldf::sqldf(
        "select t1.*, t2.*
         from df_1 t1
           left join df_2 t2 on t1.shift_1 = t2.shift_2 and t1.level_1 = t2.level_2
             and t1.age_1_start <= t2.age_2 and t1.age_1_end >= t2.age_2"),
      check = FALSE
    )
    # # A tibble: 3 x 13
    #   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time      gc       
    #   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>    <list>   
    # 1 fuzzyjoin  134.12ms 143.24ms      6.98     107KB     6.98     2     2      286ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
    # 2 data.table   2.14ms   2.63ms    335.       114KB     2.06   163     1      487ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~
    # 3 sqldf       21.14ms  22.72ms     42.9      184KB     4.52    19     2      442ms <NULL> <Rprofmem[,3~ <bch:tm ~ <tibble ~