I have a large vector x
(~200k
elements) where each element is a comma separated string. I have also a small lookup table lkp
, which maps old strings to new strings. For the sake of this example, let's assume that this is simply a named vector.
What I want to do is:
x
into its tokensA rather straight forward implementation looks like this:
library(stringr)
x <- c("a,b,c", "c", "b,c", "a,b")
lkp <- c(a = "A", b = "A", c = "B")
tokens <- str_split(x, fixed(","))
lapply(tokens, \(t) sort(unique(lkp[t])))
# [[1]]
# [1] "A" "B"
# [[2]]
# [1] "B"
# [[3]]
# [1] "A" "B"
# [[4]]
# [1] "A"
I observed while str_split
is super fast, lapply
may take quite some time:
library(tictoc)
xbig <- x[sample(length(x), 2e6, TRUE)]
tic("str_split")
tokens <- str_split(xbig, fixed(","))
toc()
# str_split: 0.89 sec elapsed
tic("lapply")
res <- lapply(tokens, \(t) sort(unique(lkp[t])))
toc()
# lapply: 60.33 sec elapsed
So I was wondering, whether there is smarter way of doing this taking advantage of vectorization?
If I drop the uniqueness and sort property a better way would be
tic("split")
res <- split(lkp[unlist(tokens)], rep(seq_along(tokens), lengths(tokens)))
toc()
# split: 2.89 sec elapsed
So I was wondering (given the super fast speed of str_split
) whether there is maybe a regex
solution which could benefit from stringr
's speed?
We can use
str_replace_all
to leverage the named vector lkp
to do inplace replacement.str_split
.library(stringr)
library(tictoc)
lapply(str_split(str_replace_all(x, lkp), fixed(",")), unique)
#[[1]]
#[1] "A" "B"
#[[2]]
#[1] "B"
#[[3]]
#[1] "A" "B"
#[[4]]
#[1] "A"
On bigger vector.
tic("str_replace_all")
step1 <- str_replace_all(xbig, lkp)
toc()
#str_replace_all: 1.288 sec elapsed
tic("str_split")
step2 <- str_split(step1, fixed(","))
toc()
#str_split: 1.328 sec elapsed
tic("unique")
step3 <- lapply(step2, unique)
toc()
#unique: 5.955 sec elapsed
The timing is better than what you currently have but I think step 2 and 3 can be solved in 1 step without splitting the string using regex but can't come up with suitable regex right now.