Search code examples
rstringr

Base R grep-family is much faster than `stringr` variants when dealing with factors


I have been using stringr since it's supposed to be faster, but I found out today that it's much slower when dealing with factor terms. I didn't see any warning that this would be the case nor why it is.

For example:

string_options = c("OneWord", "TwoWords", "ThreeWords")

sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)

When working with character terms, base R is slower than stringr, as expected. But when dealing with factor terms, base R is like 30x faster.

bench::mark(
    base_chars = grepl("Two", sample_chars),
    stringr_chars = str_detect(sample_chars, "Two"),
    base_facts = grepl("Two", sample_facts),
    stringr_facts = str_detect(sample_facts, "Two")
)

# A tibble: 4 × 13
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result            memory             time             gc      
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>            <list>             <list>           <list>  
#1 base_chars     116.1ms 116.38ms      8.58    3.81MB        0     5     0      583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]>   <tibble>
#2 stringr_chars  86.04ms   88.2ms     11.3     3.81MB        0     6     0      532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]>   <tibble>
#3 base_facts      3.59ms   3.65ms    271.     11.44MB        0   136     0      501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts  90.71ms  91.29ms     10.9    11.44MB        0     6     0      549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]>   <tibble>

It looks like stringr isn't doing anything different with factor terms but base R is significantly optimizing it. Is this expected behaviour? Should I report this as a stringr issue? Is there some stringr setting I'm completely missing? I'd like to not have to think about the format of the data to determine if I'm using stringr or base R.


Solution

  • It is quicker to search unique values than all elements

    As MrFlick pointed out in the comments, this occurs because factors are a special case in grepl() where it matches on the levels, rather than the entire vector. You have a vector with a million elements but only three unique values, so this is a lot less work.

    I don't think stringr has an argument to tell it that you are dealing with data with many repeated values. However, you can write a function that applies the grepl() logic to factors with stringr:

    str_detect_factor <- function(fct, pattern, negate = FALSE) {
        out <- stringr::str_detect(
            c(levels(fct), NA_character_), pattern, negate
        )
        outna <- out[length(out)]
        out <- out[fct]
        out[is.na(fct)] <- outna
        out
    }
    

    This is marginally faster than grepl():

    bench::mark(
        base_facts = grepl("Two", sample_facts),
        stringr_facts = str_detect(sample_facts, "Two"),
        stringr_facts2 = str_detect_factor(sample_facts, "Two")
    )
    
    #   expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
    #   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
    # 1 base_facts       3.27ms   4.15ms    220.      11.4MB    112.     49    25      222ms
    # 2 stringr_facts   182.1ms 182.44ms      5.44    11.4MB      0       3     0      552ms
    # 3 stringr_facts2   3.33ms   3.75ms    252.      11.4MB     96.6    73    28      290ms
    

    Applying this optimisation to character vectors

    Your character vector also has many repeated values, so you can apply the same logic. This will not be as fast as with factors, where distinct values are already stored within the object, so you do not have to call unique(). However, it should be faster than the existing approaches:

    str_detect_rep <- function(string, pattern, negate = FALSE) {
        unique_string <- c(unique(string), NA_character_)
        out <- stringr::str_detect(
            unique_string, pattern, negate
        ) |> setNames(unique_string)
        outna <- out[length(out)]
        out <- out[string]
        out[is.na(string)] <- outna
        unname(out)
    }
    

    We can see it is substantially faster then the base R or stringr versions:

    bench::mark(
        base_chars = grepl("Two", sample_chars),
        stringr_chars = str_detect(sample_chars, "Two"),
        stringr_chars2 = str_detect_rep(sample_chars, "Two")
    )
    
    # A tibble: 3 × 9
    #   expression          min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
    #   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
    # 1 base_chars      131.2ms  173.1ms      5.80    3.81MB     0        4     0    689.6ms
    # 2 stringr_chars     213ms  217.5ms      4.60    3.81MB     2.30     2     1    434.9ms
    # 3 stringr_chars2   37.3ms   37.3ms     26.8    42.33MB   295.       1    11     37.3ms
    

    Creating a generic

    As your issue was that you did not want to think about whether you were dealing with character or factor data, you can create a generic:

    str_detect2 <- function(string, ...) UseMethod("str_detect2")
    str_detect2.default <- function(string, ...) str_detect_rep(string, ...)
    str_detect2.factor <- function(string, ...) str_detect_factor(string, ...)
    

    Then to benchmark:

    bench::mark(
        base_facts = grepl("Two", sample_facts),
        stringr_facts = str_detect(sample_facts, "Two"),
        str_detect2_facts = str_detect2(sample_facts, "Two"),
        base_chars = grepl("Two", sample_chars),
        stringr_chars = str_detect(sample_chars, "Two"),
        str_detect2_chars = str_detect2(sample_chars, "Two")
    )
    
    #   expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
    #   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
    # 1 base_facts          3.77ms    6.3ms    154.     11.44MB     40.3    42    11      273ms
    # 2 stringr_facts     197.82ms  207.1ms      4.65   11.44MB      0       3     0      645ms
    # 3 str_detect2_facts   3.55ms    4.8ms    190.     11.44MB     43.5    70    16      368ms
    # 4 base_chars        121.17ms  125.8ms      7.47    3.81MB      0       4     0      536ms
    # 5 stringr_chars     215.54ms  232.9ms      4.36    3.81MB      0       3     0      689ms
    # 6 str_detect2_chars   27.5ms   29.2ms     31.2    42.33MB     85.8     4    11      128ms
    

    Now you can just call str_detect2() and not have to think about it - it is faster here character and factor data:

    enter image description here