As input, I've got a large data frame in R with lists of strings of different lengths, referring to certain codes - like this:
glt.code glt.phylogeny
1 adha1238 adha1238
2 adiv1239 adiv1239
3 adiw1235 adiw1235
4 aerr1238 jikr1238;jikr1238;aerr1238
I would like to replace the codes in the column glt.phylogeny
based on their name in this look-up table:
code name level
1 adha1238 Adhari language
2 adiv1239 Kotia-Adivasi Oriya-Desiya language
3 adiw1235 Adiwasi Garasia language
4 aerr1238 Aer language
5 jikr1238 Jikrio Aer dialect
My desired output looks like this:
glt.code glt.phylogeny.names
1 adha1238 Adhari
2 adiv1239 Kotia-Adivasi Oriya-Desiya
3 adiw1235 Adiwasi Garasia
4 aerr1238 Jikrio Aer;Jikrio Aer;Aer
I'd like to find a pipelined (dplyr) solution replacing all substrings in a column of a data frame based on a look-up table. I've experimented using str_replace_all
and stri_replace_all_fixed
, based on other questions on Stack Overflow but without a working result.
The actual data frame and look-up table is much larger than that so a scalable solution would be appreciated.
I think this will get you what you want using str_replace_all
by extracting a named vector from your lookup table:
library(tidyverse)
# data
df <- tibble(
glt.code = c("adha1238", "adiv1239", "adiw1235", "aerr1238"),
glt.phylogeny = c("adha1238", "adiv1239", "adiw1235", "jikr1238;jikr1238;aerr1238")
)
df_look_up <- tibble(
code = c("adha1238", "adiv1239", "adiw1235", "aerr1238", "jikr1238"),
name = c("Adhari", "Kotia-Adivasi Oriya-Desiya", "Adivasi Garasia", "Aer", "Jikrio Aer"),
level = c("language",
"language", "language", "language", "dialect")
)
# create named vector
named_look_up_vector <- df_look_up %>%
pull(name, code)
# use str_replace_all
df %>%
mutate(glt.phylogeny = str_replace_all(glt.phylogeny, named_look_up_vector))
# # A tibble: 4 × 2
# glt.code glt.phylogeny
# <chr> <chr>
# 1 adha1238 Adhari
# 2 adiv1239 Kotia-Adivasi Oriya-Desiya
# 3 adiw1235 Adivasi Garasia
# 4 aerr1238 Jikrio Aer;Jikrio Aer;Aer