Data frame dat
includes a set of numeric ids in a vector called code_num
. Some of these ids end with one or more zeros. Others do not. Here are the first three lines:
code_num X1 X2 X3 … X50
251000 NA NA NA NA
112020 NA NA NA NA
537199 NA NA NA NA
The full data of dat
are in the first tab of this google sheet.
Another data frame lut
includes another set of numeric ids called code_num_moredetail
that need to be associated with the higher-level identifiers in dat
. Here are seven example observations of lut
:
code_num_moredetail
251000.99
251743.00
251222.02
112020.01
112029.01
537119.00
537119.99
The full data of lut
are in the second tab of this google sheet.
The trailing zeros in dat$code_num
are wild card digits. Any value of lut$code_num_moredetail
that match the numbers preceding the trailing zeros of dat$code_num
should be considered a matching value, and needs to be added to the ith value of dat$X1
through dat$X50
(or beyond - I'm not certain how many matches to expect).
Consider two example cases:
dat$code_num
= 999000
, then every value of lut$code_num_moredetail
that matched the pattern 999###.##
would need to be inserted into the columns that begin with the letter X in dat
.dat$code_num
= 999090
then every value of lut$code_num_moredetail
that matched the pattern 99909#.##
would need to be inserted into the columns that begin with the letter X in dat
.Using only the values provided in the example data frames, the final solution would make dat
look like this:
code_num X1 X2 X3
251000 251000.99 251743.00 251222.02
112020 112020.01 112029.01 NA
537199 537119.00 537119.99 NA
I'm seeking an efficient way to augment dat
with all wild-card-matched values of lut
.
Note: some values of dat$code_num
may not match any value of lut$code_num_moredetail
- a proper solution must accommodate i matches, where i can range from 0 to 50.
Try
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
out <- lut %>%
mutate(new = substr(code_num_moredetail, 1, 3)) %>%
left_join(dat %>%
transmute(code_num, new = substr(code_num, 1, 3))) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
out
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
The digits are in the data. It is just the tibble print
print(out$X3, digits = 10)
[1] 251222.02 NA NA
Or may be
library(fuzzyjoin)
dat1 <- dat %>%
transmute(code_num, new = sub("0+$", "", code_num))
lut$new <- str_replace(sub("\\..*", "", sprintf('%.2f', lut[[1]])),
paste0(".*(", paste(dat1$new, collapse="|"), ").*"), "\\1")
stringdist_left_join(lut, dat1) %>%
select(code_num_moredetail, code_num, new = new.x) %>%
mutate(rn = str_c("X", rowid(new))) %>%
pivot_wider(names_from = rn, values_from = code_num_moredetail) %>%
select(-new)
-output
# A tibble: 3 x 4
code_num X1 X2 X3
<int> <dbl> <dbl> <dbl>
1 251000 251001. 251743 251222.
2 112020 112020. 112029. NA
3 537199 537119 537120. NA
lut <- structure(list(code_num_moredetail = c(251000.99, 251743, 251222.02,
112020.01, 112029.01, 537119, 537119.99)), row.names = c(NA,
-7L), class = "data.frame")
dat <- structure(list(code_num = c(251000L, 112020L, 537199L),
X1 = c(NA,
NA, NA), X2 = c(NA, NA, NA), X3 = c(NA, NA, NA)), class = "data.frame",
row.names = c(NA,
-3L))