Efficient Way to Replace / Lookup Tokens in a String

I have a large vector x (~200k elements) where each element is a comma separated string. I have also a small lookup table lkp, which maps old strings to new strings. For the sake of this example, let's assume that this is simply a named vector.

What I want to do is:

I want to split each element in x into its tokens
Replace the tokens with the help of lkp
Remove duplicates from the replacement and sort

A rather straight forward implementation looks like this:

library(stringr)
x <- c("a,b,c", "c", "b,c", "a,b")
lkp <- c(a = "A", b = "A", c = "B")

tokens <- str_split(x, fixed(","))
lapply(tokens, \(t) sort(unique(lkp[t])))

# [[1]]
# [1] "A" "B"

# [[2]]
# [1] "B"

# [[3]]
# [1] "A" "B"

# [[4]]
# [1] "A"

I observed while str_split is super fast, lapply may take quite some time:

library(tictoc)
xbig <- x[sample(length(x), 2e6, TRUE)]
tic("str_split")
tokens <- str_split(xbig, fixed(","))
toc()
# str_split: 0.89 sec elapsed

tic("lapply")
res <- lapply(tokens, \(t) sort(unique(lkp[t])))
toc()
# lapply: 60.33 sec elapsed

So I was wondering, whether there is smarter way of doing this taking advantage of vectorization?

If I drop the uniqueness and sort property a better way would be

tic("split")
res <- split(lkp[unlist(tokens)], rep(seq_along(tokens), lengths(tokens)))
toc()
# split: 2.89 sec elapsed

So I was wondering (given the super fast speed of str_split) whether there is maybe a regex solution which could benefit from stringr's speed?

Solution

We can use

str_replace_all to leverage the named vector lkp to do inplace replacement.
split the data on comma using str_split.
Get unique values.

library(stringr)
library(tictoc)


lapply(str_split(str_replace_all(x, lkp), fixed(",")), unique)

#[[1]]
#[1] "A" "B"

#[[2]]
#[1] "B"

#[[3]]
#[1] "A" "B"

#[[4]]
#[1] "A"

On bigger vector.

tic("str_replace_all")
step1 <- str_replace_all(xbig, lkp)
toc()

#str_replace_all: 1.288 sec elapsed

tic("str_split")
step2 <- str_split(step1, fixed(","))
toc()

#str_split: 1.328 sec elapsed

tic("unique")
step3 <- lapply(step2, unique)
toc()
#unique: 5.955 sec elapsed

The timing is better than what you currently have but I think step 2 and 3 can be solved in 1 step without splitting the string using regex but can't come up with suitable regex right now.