Search code examples
rfuzzyjoin

Inexact joining data based on greater equal condition


I have some values in df:

# A tibble: 7 × 1
   var1
  <dbl>
1     0
2    10
3    20
4   210
5   230
6   266
7   267

that I would like to compare to a second dataframe called value_lookup

# A tibble: 4 × 2
   var1 value
  <dbl> <dbl>
1     0     0
2   200    10
3   230    20
4   260    30

In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.

Here is the expected output:

  var1 value
1    0     0
2   10     0
3   20     0
4  210    10
5  230    20
6  266    30
7  267    30

I thought it should be doable using {fuzzyjoin} but I cannot get it done.

value_lookup <- tibble(var1 = c(0, 200,230,260),
                       value = c(0,10,20,30))

df <- tibble(var1 = c(0,10,20,210,230,266,267))

library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
  x = df, 
  y = value_lookup ,
  by = "var1",
  match_fun = list(`>=`)
) 

Solution

  • An option is also findInterval:

    df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
    

    Output:

      var1 value
    1    0     0
    2   10     0
    3   20     0
    4  210    10
    5  230    20
    6  266    30
    7  267    30
    

    As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:

    library(data.table)
    
    setDT(value_lookup)[setDT(df), on = 'var1', roll = T]