Search code examples
rloopsstringr

Efficient Way to Replace / Lookup Tokens in a String


I have a large vector x (~200k elements) where each element is a comma separated string. I have also a small lookup table lkp, which maps old strings to new strings. For the sake of this example, let's assume that this is simply a named vector.

What I want to do is:

  1. I want to split each element in x into its tokens
  2. Replace the tokens with the help of lkp
  3. Remove duplicates from the replacement and sort

A rather straight forward implementation looks like this:

library(stringr)
x <- c("a,b,c", "c", "b,c", "a,b")
lkp <- c(a = "A", b = "A", c = "B")

tokens <- str_split(x, fixed(","))
lapply(tokens, \(t) sort(unique(lkp[t])))

# [[1]]
# [1] "A" "B"

# [[2]]
# [1] "B"

# [[3]]
# [1] "A" "B"

# [[4]]
# [1] "A"

I observed while str_split is super fast, lapply may take quite some time:

library(tictoc)
xbig <- x[sample(length(x), 2e6, TRUE)]
tic("str_split")
tokens <- str_split(xbig, fixed(","))
toc()
# str_split: 0.89 sec elapsed

tic("lapply")
res <- lapply(tokens, \(t) sort(unique(lkp[t])))
toc()
# lapply: 60.33 sec elapsed

So I was wondering, whether there is smarter way of doing this taking advantage of vectorization?

If I drop the uniqueness and sort property a better way would be

tic("split")
res <- split(lkp[unlist(tokens)], rep(seq_along(tokens), lengths(tokens)))
toc()
# split: 2.89 sec elapsed

So I was wondering (given the super fast speed of str_split) whether there is maybe a regex solution which could benefit from stringr's speed?


Solution

  • We can use

    1. str_replace_all to leverage the named vector lkp to do inplace replacement.
    2. split the data on comma using str_split.
    3. Get unique values.
    library(stringr)
    library(tictoc)
    
    
    lapply(str_split(str_replace_all(x, lkp), fixed(",")), unique)
    
    #[[1]]
    #[1] "A" "B"
    
    #[[2]]
    #[1] "B"
    
    #[[3]]
    #[1] "A" "B"
    
    #[[4]]
    #[1] "A"
    

    On bigger vector.

    tic("str_replace_all")
    step1 <- str_replace_all(xbig, lkp)
    toc()
    
    #str_replace_all: 1.288 sec elapsed
    
    tic("str_split")
    step2 <- str_split(step1, fixed(","))
    toc()
    
    #str_split: 1.328 sec elapsed
    
    tic("unique")
    step3 <- lapply(step2, unique)
    toc()
    #unique: 5.955 sec elapsed
    

    The timing is better than what you currently have but I think step 2 and 3 can be solved in 1 step without splitting the string using regex but can't come up with suitable regex right now.