t1 <- data.frame(
team = c('a', 'b', 'c', 'd', 'e'),
value1 = c(0.285, 0.37, 0.45, 0.42, 0.385),
value2 = c(41, 51, 55, 61, 64)
)
pctiles = data.frame(
pctile = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
value1 = c(0.253, 0.291, 0.325, 0.336, 0.345, 0.35, 0.367, 0.39, 0.4, 0.41, 0.435),
value2 = c(35, 50, 54, 57, 59, 61, 62, 65, 71, 81, 95)
)
we need to join pctile
values from the pctiles
dataframe onto the t1
dataframe, based on which value in the pctiles
dataframe that are closest to the values in the t1
dataframe. For example, in t1
, team A has a value1 of 0.285, which is closest with 0.291 in pctiles$value1
. As a result, a value1pctile
of 1 should be joined onto t1. By this logic, joining the closest pctile for each value, the output we are going for would be:
output_df <- data.frame(
team = c('a', 'b', 'c', 'd', 'e'),
value1 = c(0.285, 0.37, 0.45, 0.43, 0.385),
value2 = c(41, 51, 55, 61, 64),
value1pctile = c(1, 6, 10, 9, 7),
value2pctile = c(0, 1, 3, 5, 7)
)
We are not worried about how situations are handled when a value in t1
falls exactly between two values in pctiles
. Either the higher or lower pctile
value is just fine. How can we achieve this in R?
Edit: We are trying to use fuzzy_left_join
but are hitting an error:
t1 <- fuzzyjoin::fuzzy_left_join(
t1, pctiles,
by = c("value1" = "value1"),
match_fun = "min_diff",
distance_col = "dist"
)
> Error in which(m) : argument to 'which' is not logical
it seems we are using match_fun
and distance_col
incorrectly.
With data.table
:
library(data.table)
setDT(t1)
setDT(pctiles)
t1[, value1pctile := pctiles[.SD, on = "value1", roll = "nearest", pctile]]
t1[, value2pctile := pctiles[.SD, on = "value2", roll = "nearest", pctile]]
# team value1 value2 value1pctile value2pctile
# <char> <num> <num> <num> <num>
# 1: a 0.285 41 1 0
# 2: b 0.370 51 6 1
# 3: c 0.450 55 10 2
# 4: d 0.420 61 9 5
# 5: e 0.385 64 7 7