I have a data frame with a lot of data and I also have a lookup/reference data frame. I'd like to add the unique ID from the lookup/reference data frame based upon three criteria found in the lookup/reference data frame.
Data
df <- data.frame(fruit = c("mango", "strawberry", "strawberry", "apple", "mango"),
size = c(34, 17, 29, 10, 1),
var_x = c("X4", "X1", "X1a", "X1", "X1b")
)
fruit size var_x
1 mango 34 X4
2 strawberry 17 X1
3 strawberry 29 X1a
4 apple 10 X1
5 mango 1 X1b
Lookup
lookup <- data.frame(unique_id = LETTERS[1:7],
fruit = c("apple", "apple", "strawberry", "mango", "mango", "mango", "mango"),
size = c("1:9", "1:9", "10:40", "1:100", "1:9", "1:9", "10:40"),
var_x = c("c('X1', 'X1a', 'X1b')", "X3", "c('X1', 'X1a', 'X1b')", "c('X1', 'X1a', 'X1b')", "X2", "X3","c('X1', 'X4')")
)
unique_id fruit size var_x
1 A apple 1:9 c('X1', 'X1a', 'X1b')
2 B apple 1:9 X3
3 C strawberry 10:40 c('X1', 'X1a', 'X1b')
4 D mango 1:100 c('X1', 'X1a', 'X1b')
5 E mango 1:9 X2
6 F mango 1:9 X3
7 G mango 10:40 c('X1', 'X4')
Within the lookup data frame, size
is a range and var_x
can be one or multiple options.
Ideally, I'm trying to write a function to produce the following, where it iterates through each column. If all of the criteria are met, then it assigns the appropriate unique_id
from the lookup data frame. In my example, row 1 is matched with G
because fruit == "mango"
& size %in% 10:40
& var_x %in% c("X1", "X4")
:
unique_id fruit size var_x
1 G mango 34 X4
2 C strawberry 17 X1
3 C strawberry 29 X1a
4 <NA> apple 10 X1
5 B apple 9 X3
Basically, I need to match/merge based on the lookup table, which potentially contains multiple options per variable. I believe I need to parse the expressions in some way and then evaluate.
Here, I convert the lookup table into something a little easier to work with, separating the size range into two columns, and giving each var_x
value its own row.
library(tidyverse)
lookup2 <- lookup |>
separate(size, c("size_min", "size_max"), convert = TRUE) |>
mutate(var_x = var_x |>
str_remove_all("c\\('|'|\\)")) |> # to remove c(' and ' and )
separate_longer_delim(var_x, ", ")
Then it's just a join: (using dplyr 1.1.0+, for non-equi joins)
df |>
left_join(lookup2,
join_by(fruit, between(size, size_min, size_max), var_x))
Result (I think your example output has two typos)
fruit size var_x unique_id size_min size_max
1 mango 34 X4 G 10 40
2 strawberry 17 X1 C 10 40
3 strawberry 29 X1a C 10 40
4 apple 10 X1 <NA> NA NA
5 mango 1 X1b D 1 100