Search code examples
rregexstrsplit

Greediness of strsplit and stri_extract_all_regex


I am trying to get the strsplit and stri_extract_all_regex to work consistently. Splitting by strsplit is the desired behavior (in practice I have a long string of alternative matches that is generated dynamically by the code).

For example

strsplit("1234567","(23)|(234)")
[[1]]
[1] "1"   "567"

vs

stri_extract_all_regex("1234567","(23)|(234)")
[[1]]
[1] "23"

The desired output from extraction is

[[1]]
[1] "234"

Solution

  • Regex engines: PCRE vs ERE vs ICU

    Base R uses Extended Regular Expressions (ERE) by default, or Perl Compatible Regular Expressions (PCRE) if specified with perl = TRUE. Conversely, the stringi docs state it:

    provides access to the regex engine implemented in ICU, which was inspired by Java’s util.regex in JDK 1.4

    As noted in Why the order matters in this regex with alternation, the order matters since that is the order which the Regex engine will try to match. This applies to the .NET regex engine in that question, and we can see below it also applies to PCRE and ICU, but not ERE, which is why you get a different result with strsplit() (unless you set perl = TRUE).

    library(stringi)
    x <- "1234567"
    pattern <- "(23)|(234)"
    
    # Base R (ERE) engine
    regmatches(x, gregexpr(pattern, x))  |> unlist()
    # [1] "234"
    
    # Base R (PCRE) engine
    regmatches(x, gregexpr(pattern, x, perl = TRUE))  |> unlist()
    # [1] "23"
    
    stri_extract_all_regex(x, pattern)  |> unlist()
    # [1] "23"
    

    As this answer to that question states:

    This is kind of like the || operator in C#. Once the left side matches the right side is ignored.

    So we need to switch the pattern so the preferred (i.e. longest) match is first: pattern <- "(234)|(23)".

    Arranging your pattern so the longest match is first

    As your pattern is dynamically generated, you can't just hardcode the order. However, you can write a function to sort it by the longest pattern first.

    switch_pattern_order  <- function(pattern){
        pattern_split  <- unlist(strsplit(pattern, "|", fixed = TRUE))
        pattern_sorted  <- pattern_split[order(nchar(pattern_split), decreasing = TRUE)]
        new_pattern  <- paste(pattern_sorted, collapse = "|")
        new_pattern
    }
    

    This assumes the only operator is "|" but it can be extended if there are others.

    This will now work with any of the regex engines:

    pattern  <- switch_pattern_order(pattern) # "(234)|(23)"
    regmatches(x, gregexpr(pattern, x))  |> unlist()
    # [1] "234"
    
    regmatches(x, gregexpr(pattern, x, perl = TRUE))  |> unlist()
    # [1] "234"
    
    stri_extract_all_regex(x, pattern)  |> unlist()
    # [1] "234"