Search code examples
rleft-joindata-manipulation

Left join dataframe in R based on the closest/nearest values in the right-hand-side dataframe


t1 <- data.frame(
    team = c('a', 'b', 'c', 'd', 'e'),
    value1 = c(0.285, 0.37, 0.45, 0.42, 0.385),
    value2 = c(41, 51, 55, 61, 64)
)
  
pctiles = data.frame(
    pctile = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
    value1 = c(0.253, 0.291, 0.325, 0.336, 0.345, 0.35, 0.367, 0.39, 0.4, 0.41, 0.435),
    value2 = c(35, 50, 54, 57, 59, 61, 62, 65, 71, 81, 95)
)

we need to join pctile values from the pctiles dataframe onto the t1 dataframe, based on which value in the pctiles dataframe that are closest to the values in the t1 dataframe. For example, in t1, team A has a value1 of 0.285, which is closest with 0.291 in pctiles$value1. As a result, a value1pctile of 1 should be joined onto t1. By this logic, joining the closest pctile for each value, the output we are going for would be:

output_df <- data.frame(
    team = c('a', 'b', 'c', 'd', 'e'),
    value1 = c(0.285, 0.37, 0.45, 0.43, 0.385),
    value2 = c(41, 51, 55, 61, 64),
    value1pctile = c(1, 6, 10, 9, 7),
    value2pctile = c(0, 1, 3, 5, 7)
)

We are not worried about how situations are handled when a value in t1 falls exactly between two values in pctiles. Either the higher or lower pctile value is just fine. How can we achieve this in R?

Edit: We are trying to use fuzzy_left_join but are hitting an error:

t1 <- fuzzyjoin::fuzzy_left_join(
  t1, pctiles, 
  by = c("value1" = "value1"), 
  match_fun = "min_diff",
  distance_col = "dist"
)

> Error in which(m) : argument to 'which' is not logical

it seems we are using match_fun and distance_col incorrectly.


Solution

  • With data.table:

    library(data.table)
    setDT(t1)
    setDT(pctiles)
    
    t1[, value1pctile := pctiles[.SD, on = "value1", roll = "nearest", pctile]]
    t1[, value2pctile := pctiles[.SD, on = "value2", roll = "nearest", pctile]]
    
    #      team value1 value2 value1pctile value2pctile
    #    <char>  <num>  <num>        <num>        <num>
    # 1:      a  0.285     41            1            0
    # 2:      b  0.370     51            6            1
    # 3:      c  0.450     55           10            2
    # 4:      d  0.420     61            9            5
    # 5:      e  0.385     64            7            7