Search code examples
rmergematchlookup-tables

Merge and/or match based on lookup table in r


I have a data frame with a lot of data and I also have a lookup/reference data frame. I'd like to add the unique ID from the lookup/reference data frame based upon three criteria found in the lookup/reference data frame.

Data

df <- data.frame(fruit = c("mango", "strawberry", "strawberry", "apple", "mango"),
                 size = c(34, 17, 29, 10, 1),
                 var_x = c("X4", "X1", "X1a", "X1", "X1b")
)

       fruit size var_x
1      mango   34    X4
2 strawberry   17    X1
3 strawberry   29   X1a
4      apple   10    X1
5      mango    1   X1b

Lookup

lookup <- data.frame(unique_id = LETTERS[1:7],
                                         fruit = c("apple", "apple", "strawberry", "mango", "mango", "mango", "mango"),
                                         size = c("1:9", "1:9", "10:40", "1:100", "1:9", "1:9", "10:40"),
                                         var_x = c("c('X1', 'X1a', 'X1b')",  "X3", "c('X1', 'X1a', 'X1b')", "c('X1', 'X1a', 'X1b')", "X2", "X3","c('X1', 'X4')")
)

  unique_id      fruit  size                 var_x
1         A      apple   1:9 c('X1', 'X1a', 'X1b')
2         B      apple   1:9                    X3
3         C strawberry 10:40 c('X1', 'X1a', 'X1b')
4         D      mango 1:100 c('X1', 'X1a', 'X1b')
5         E      mango   1:9                    X2
6         F      mango   1:9                    X3
7         G      mango 10:40         c('X1', 'X4')

Within the lookup data frame, size is a range and var_x can be one or multiple options.

Ideally, I'm trying to write a function to produce the following, where it iterates through each column. If all of the criteria are met, then it assigns the appropriate unique_id from the lookup data frame. In my example, row 1 is matched with G because fruit == "mango" & size %in% 10:40 & var_x %in% c("X1", "X4") :

  unique_id      fruit size var_x
1         G      mango   34    X4
2         C strawberry   17    X1
3         C strawberry   29   X1a
4      <NA>      apple   10    X1
5         B      apple    9    X3

Basically, I need to match/merge based on the lookup table, which potentially contains multiple options per variable. I believe I need to parse the expressions in some way and then evaluate.


Solution

  • Here, I convert the lookup table into something a little easier to work with, separating the size range into two columns, and giving each var_x value its own row.

    library(tidyverse)
    
    lookup2 <- lookup |>
      separate(size, c("size_min", "size_max"), convert = TRUE) |>
      mutate(var_x = var_x |> 
               str_remove_all("c\\('|'|\\)")) |>  # to remove c(' and ' and )
      separate_longer_delim(var_x, ", ")
    

    Then it's just a join: (using dplyr 1.1.0+, for non-equi joins)

    df |>
      left_join(lookup2,
                join_by(fruit, between(size, size_min, size_max), var_x))
        
    

    Result (I think your example output has two typos)

           fruit size var_x unique_id size_min size_max
    1      mango   34    X4         G       10       40
    2 strawberry   17    X1         C       10       40
    3 strawberry   29   X1a         C       10       40
    4      apple   10    X1      <NA>       NA       NA
    5      mango    1   X1b         D        1      100