Search code examples
rdplyrfuzzyjoin

Join with closest value between two values in R


I was working in the following problem. I've got monthly data from a survey, let's call it df:

df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))

ID     reported_value   anchor_month
1          1200             3
2          31000            5

So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:

df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
             real_value = c(1200,1230,11000,10,25000,3100,100,31030),
            month = c(1,2,3,4,2,3,4,5))


ID     real_value         month
1          1200             1
1          1230             2
1          11000            3
1          10               4
2          25000            2
2          3100             3
2          100              4
2          31030            5

So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:

filter_aux = df1 %>% 
 bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 =  df2 %>% 
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month'))  %>% distinct(ID, month)

Reducing df2 to:

ID     real_value         month
1          1230             2
1          11000            3
2          100              4
2          31030            5

Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.

My final dataframe would look like this:

ID     reported_value   anchor_month closest_value month
1          1200             3            1230        2
2          31000            5            31030       5

Thanks!


Solution

  • It was easier without fuzzy_join:

    df3 = df1 %>% left_join(df2 , by='ID') %>%
      mutate(dif = abs(real_value - reported_value)) %>%
      group_by(ID) %>% filter(dif == min(dif))
    

    Output:

      ID    reported_value anchor_month real_value month   dif
      <chr>          <dbl>        <dbl>      <dbl> <dbl> <dbl>
    1 1               1200            3       1230     2    30
    2 2              31000            5      31030     5    30