I have been using stringr
since it's supposed to be faster, but I found out today that it's much slower when dealing with factor terms. I didn't see any warning that this would be the case nor why it is.
For example:
string_options = c("OneWord", "TwoWords", "ThreeWords")
sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)
When working with character
terms, base R is slower than stringr
, as expected. But when dealing with factor
terms, base R is like 30x faster.
bench::mark(
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two")
)
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#1 base_chars 116.1ms 116.38ms 8.58 3.81MB 0 5 0 583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]> <tibble>
#2 stringr_chars 86.04ms 88.2ms 11.3 3.81MB 0 6 0 532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]> <tibble>
#3 base_facts 3.59ms 3.65ms 271. 11.44MB 0 136 0 501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts 90.71ms 91.29ms 10.9 11.44MB 0 6 0 549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]> <tibble>
It looks like stringr
isn't doing anything different with factor
terms but base R is significantly optimizing it. Is this expected behaviour? Should I report this as a stringr
issue? Is there some stringr
setting I'm completely missing? I'd like to not have to think about the format of the data to determine if I'm using stringr
or base R.
As MrFlick pointed out in the comments, this occurs because factors are a special case in grepl()
where it matches on the levels, rather than the entire vector. You have a vector with a million elements but only three unique values, so this is a lot less work.
I don't think stringr
has an argument to tell it that you are dealing with data with many repeated values. However, you can write a function that applies the grepl()
logic to factors with stringr
:
str_detect_factor <- function(fct, pattern, negate = FALSE) {
out <- stringr::str_detect(
c(levels(fct), NA_character_), pattern, negate
)
outna <- out[length(out)]
out <- out[fct]
out[is.na(fct)] <- outna
out
}
This is marginally faster than grepl()
:
bench::mark(
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two"),
stringr_facts2 = str_detect_factor(sample_facts, "Two")
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 base_facts 3.27ms 4.15ms 220. 11.4MB 112. 49 25 222ms
# 2 stringr_facts 182.1ms 182.44ms 5.44 11.4MB 0 3 0 552ms
# 3 stringr_facts2 3.33ms 3.75ms 252. 11.4MB 96.6 73 28 290ms
Your character vector also has many repeated values, so you can apply the same logic. This will not be as fast as with factors, where distinct values are already stored within the object, so you do not have to call unique()
. However, it should be faster than the existing approaches:
str_detect_rep <- function(string, pattern, negate = FALSE) {
unique_string <- c(unique(string), NA_character_)
out <- stringr::str_detect(
unique_string, pattern, negate
) |> setNames(unique_string)
outna <- out[length(out)]
out <- out[string]
out[is.na(string)] <- outna
unname(out)
}
We can see it is substantially faster then the base R or stringr
versions:
bench::mark(
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
stringr_chars2 = str_detect_rep(sample_chars, "Two")
)
# A tibble: 3 × 9
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 base_chars 131.2ms 173.1ms 5.80 3.81MB 0 4 0 689.6ms
# 2 stringr_chars 213ms 217.5ms 4.60 3.81MB 2.30 2 1 434.9ms
# 3 stringr_chars2 37.3ms 37.3ms 26.8 42.33MB 295. 1 11 37.3ms
As your issue was that you did not want to think about whether you were dealing with character or factor data, you can create a generic:
str_detect2 <- function(string, ...) UseMethod("str_detect2")
str_detect2.default <- function(string, ...) str_detect_rep(string, ...)
str_detect2.factor <- function(string, ...) str_detect_factor(string, ...)
Then to benchmark:
bench::mark(
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two"),
str_detect2_facts = str_detect2(sample_facts, "Two"),
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
str_detect2_chars = str_detect2(sample_chars, "Two")
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
# 1 base_facts 3.77ms 6.3ms 154. 11.44MB 40.3 42 11 273ms
# 2 stringr_facts 197.82ms 207.1ms 4.65 11.44MB 0 3 0 645ms
# 3 str_detect2_facts 3.55ms 4.8ms 190. 11.44MB 43.5 70 16 368ms
# 4 base_chars 121.17ms 125.8ms 7.47 3.81MB 0 4 0 536ms
# 5 stringr_chars 215.54ms 232.9ms 4.36 3.81MB 0 3 0 689ms
# 6 str_detect2_chars 27.5ms 29.2ms 31.2 42.33MB 85.8 4 11 128ms
Now you can just call str_detect2()
and not have to think about it - it is faster here character and factor data: