r hash append data-cleaning lookup-tables

appending values from a look-up table to columns of another data frame based on trailing zero patterns

Data frame dat includes a set of numeric ids in a vector called code_num. Some of these ids end with one or more zeros. Others do not. Here are the first three lines:

code_num   X1   X2   X3   …   X50
251000     NA   NA   NA       NA        
112020     NA   NA   NA       NA        
537199     NA   NA   NA       NA

The full data of dat are in the first tab of this google sheet.

Another data frame lut includes another set of numeric ids called code_num_moredetail that need to be associated with the higher-level identifiers in dat. Here are seven example observations of lut:

code_num_moredetail
251000.99
251743.00
251222.02
112020.01
112029.01
537119.00
537119.99

The full data of lut are in the second tab of this google sheet.

The trailing zeros in dat$code_num are wild card digits. Any value of lut$code_num_moredetail that match the numbers preceding the trailing zeros of dat$code_num should be considered a matching value, and needs to be added to the ith value of dat$X1 through dat$X50 (or beyond - I'm not certain how many matches to expect).

Consider two example cases:

if dat$code_num = 999000, then every value of lut$code_num_moredetail that matched the pattern 999###.## would need to be inserted into the columns that begin with the letter X in dat.
if dat$code_num = 999090 then every value of lut$code_num_moredetail that matched the pattern 99909#.## would need to be inserted into the columns that begin with the letter X in dat.

Using only the values provided in the example data frames, the final solution would make dat look like this:

code_num              X1          X2          X3
251000                251000.99   251743.00   251222.02
112020                112020.01   112029.01   NA
537199                537119.00   537119.99   NA

I'm seeking an efficient way to augment dat with all wild-card-matched values of lut.

Note: some values of dat$code_num may not match any value of lut$code_num_moredetail - a proper solution must accommodate i matches, where i can range from 0 to 50.

Solution

Try

library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
out <- lut %>%
     mutate(new = substr(code_num_moredetail, 1, 3)) %>% 
     left_join(dat %>%
       transmute(code_num, new = substr(code_num, 1, 3)))  %>% 
    mutate(rn = str_c("X", rowid(new))) %>%
    pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
    select(-new)

-output

out
# A tibble: 3 x 4
  code_num      X1      X2      X3
     <int>   <dbl>   <dbl>   <dbl>
1   251000 251001. 251743  251222.
2   112020 112020. 112029.     NA 
3   537199 537119  537120.     NA

The digits are in the data. It is just the tibble print

print(out$X3, digits = 10)
[1] 251222.02        NA        NA

Or may be

library(fuzzyjoin)
dat1 <- dat %>%
         transmute(code_num, new = sub("0+$", "", code_num))

lut$new <- str_replace(sub("\\..*", "", sprintf('%.2f', lut[[1]])), 
       paste0(".*(", paste(dat1$new, collapse="|"), ").*"), "\\1")

stringdist_left_join(lut, dat1) %>% 
    select(code_num_moredetail, code_num, new = new.x) %>%
    mutate(rn = str_c("X", rowid(new))) %>%
    pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
    select(-new)

-output

# A tibble: 3 x 4
  code_num      X1      X2      X3
     <int>   <dbl>   <dbl>   <dbl>
1   251000 251001. 251743  251222.
2   112020 112020. 112029.     NA 
3   537199 537119  537120.     NA

data

lut <- structure(list(code_num_moredetail = c(251000.99, 251743, 251222.02, 
112020.01, 112029.01, 537119, 537119.99)), row.names = c(NA, 
-7L), class = "data.frame")

dat <- structure(list(code_num = c(251000L, 112020L, 537199L), 
     X1 = c(NA, 
NA, NA), X2 = c(NA, NA, NA), X3 = c(NA, NA, NA)), class = "data.frame", 
   row.names = c(NA, 
-3L))