I am trying to get the strsplit
and stri_extract_all_regex
to work consistently. Splitting by strsplit is the desired behavior (in practice I have a long string of alternative matches that is generated dynamically by the code).
For example
strsplit("1234567","(23)|(234)")
[[1]]
[1] "1" "567"
vs
stri_extract_all_regex("1234567","(23)|(234)")
[[1]]
[1] "23"
The desired output from extraction is
[[1]]
[1] "234"
Base R uses Extended Regular Expressions (ERE) by default, or Perl Compatible Regular Expressions (PCRE) if specified with perl = TRUE
. Conversely, the stringi
docs state it:
provides access to the regex engine implemented in ICU, which was inspired by Java’s
util.regex
in JDK 1.4
As noted in Why the order matters in this regex with alternation, the order matters since that is the order which the Regex engine will try to match. This applies to the .NET regex engine in that question, and we can see below it also applies to PCRE and ICU, but not ERE, which is why you get a different result with strsplit()
(unless you set perl = TRUE
).
library(stringi)
x <- "1234567"
pattern <- "(23)|(234)"
# Base R (ERE) engine
regmatches(x, gregexpr(pattern, x)) |> unlist()
# [1] "234"
# Base R (PCRE) engine
regmatches(x, gregexpr(pattern, x, perl = TRUE)) |> unlist()
# [1] "23"
stri_extract_all_regex(x, pattern) |> unlist()
# [1] "23"
As this answer to that question states:
This is kind of like the || operator in C#. Once the left side matches the right side is ignored.
So we need to switch the pattern so the preferred (i.e. longest) match is first: pattern <- "(234)|(23)"
.
As your pattern is dynamically generated, you can't just hardcode the order. However, you can write a function to sort it by the longest pattern first.
switch_pattern_order <- function(pattern){
pattern_split <- unlist(strsplit(pattern, "|", fixed = TRUE))
pattern_sorted <- pattern_split[order(nchar(pattern_split), decreasing = TRUE)]
new_pattern <- paste(pattern_sorted, collapse = "|")
new_pattern
}
This assumes the only operator is "|"
but it can be extended if there are others.
This will now work with any of the regex engines:
pattern <- switch_pattern_order(pattern) # "(234)|(23)"
regmatches(x, gregexpr(pattern, x)) |> unlist()
# [1] "234"
regmatches(x, gregexpr(pattern, x, perl = TRUE)) |> unlist()
# [1] "234"
stri_extract_all_regex(x, pattern) |> unlist()
# [1] "234"