Search code examples
rstringr

Does stringr's regex engine translate [a-z] into abcdefghijklmnopqrstuvwyz?


Please correct me if I'm wrong but the pattern: [a-z] should match any lowercase character from a to z inclusive (i.e.) [a-z] == [abcdefghijklmnopqrstuvwxyz]

pattern <- "[a-z]"

stringr::str_detect(c("word", "12345"), pattern)
[1] TRUE FALSE

Is it the case that somewhere 'under the hood' [a-z] gets translated to [abcdefghijklmnopqrstuvwxyz] or is it simply understanding this to iterate through the characters based on some numeric system?


Solution

  • tl;dr don't worry about this too much, use [:alpha:] instead (which is guaranteed to match all alphabetic characters and is considered best practice).

    @benson23's answer is good, but note that stringr uses the ICU engine (via the stringi package), documented here, which is different from the implementation used by base R (which uses TRE, or PCRE if perl = TRUE): see e.g. this answer.

    In the ICU documentation pointed to above, it says for ranges that

    The characters to include are determined by Unicode code point ordering

    So presumably under the hood it is converting characters to their Unicode representation and testing whether they fall in the range or not (not enumerating).

    Since Unicode points are independent of locale (I'm shouting because I just figured this out myself), this means that range-definition, unlike sorting/collation, will be locale-independent. (This is consistent with this answer about base-R regex range matching ...)

    Sys.setlocale(category = "LC_COLLATE", locale = "et_EE")
    [1] "et_EE"
    stringr::str_detect("T", "[A-Z]")
    [1] TRUE
    

    For what it's worth this extensive answer points out that most built-in regex implementations are not locale-specific (i.e., behave like R's regex)