Search code examples
rregex

Difference between [:alpha:] class and [a-zA-Z]; Is [:alpha:] OS independent?


I had always heard that the [:alpha:] character class was equivalent to [A-Za-z] but in the output below this does not seem to be the case regardless if I use gsub (with and without perl = TRUE) or stringi. It seems [:alpha:] matches non-ascii characters but I may be misunderstanding. Using ?regex tells me:

Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.

But I still don't get the difference. To me [A-Za-z] matches exactly 52 characters, while [:alpha:] matches way more.

Questions

  1. What is the difference between [:alpha:] class and [a-zA-Z]
  2. What exactly will [:alpha:] match?
  3. Will [:alpha:] work the same across different operating systems and locations around the world?
x <- c(
    "danish characteøs  sentåment æcores words correctly 456",
    "It works with probleme but not with problème 234"
)

gsub("[[:alpha:] ]",  '', x)
## "456" "234"

gsub("[a-zA-Z ]",  '', x)
## [1] "øåæ456" "è234"

stringi::stri_replace_all_regex(x, "[[:alpha:] ]",  '')
## "456" "234"

stringi::stri_replace_all_regex(x, "[a-zA-Z ]",  '')
## [1] "øåæ456" "è234"

Solution

    1. [:alpha:] stands for "alphabetic characters:". [:alpha:], as in the opposite to [:digit:]. This includes literally every letter character in your character encoding. Whereas [a-zA-Z] is capturing any character between the symbol 'a' and 'z', as well as 'A' and 'Z'. As @Charles Duffy noted the locale order of these can differ and so other characters can be contained. In standard English UTF-8, however, this will only include standard English letters (26 letters * 2 lower & upper case = 52), and thus will not include any letter from other languages, e.g., é, ö, ï, etc.

    2. [:alpha:] will match all alphabetic characters.

    3. Yes, since [:alpha:] matches all alphabetic characters it will work the same across different languages, operations systems or locations.

    To give more context, the regex function implemented in R (used by grepl, regexpr, gregexpr, sub or gsub, among others) follows the POSIX 1003.2 standard. This means matching is based on:

    the bit pattern used for encoding the character, not on the graphic representation of the character.

    Below is an example of variations of different language characters for Sys.getlocale(category = "LC_ALL") "en_GB.UTF-8":

    fr_chr <- "Voix ambiguë d’un cœur qui au zéphyr préfère les jattes de kiwi."
    ge_chr <- "Fix, Schwyz! quäkt Jürgen blöd vom Paß."
    gr_chr <- "Ταχίστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός."
    en_chr <- "Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth."
    cn_chr <- "敏捷的棕色狐狸跨过懒狗"
    
    gsub("[[:alpha:]]","",fr_chr)
    [1] "  ’         ."
    gsub("[[:alpha:]]","",ge_chr)
    [1] ", !     ."
    gsub("[[:alpha:]]","",gr_chr)
    [1] "    ,    ."
    gsub("[[:alpha:]]","",en_chr)
    [1] ",           ,  ."
    gsub("[[:alpha:]]","",cn_chr)
    [1] ""
    
    gsub("[A-Za-z]","",fr_chr)
    [1] " ë ’ œ   é éè    ."
    gsub("[A-Za-z]","",ge_chr)
    [1] ", ! ä ü ö  ß."
    gsub("[A-Za-z]","",gr_chr)
    [1] "Ταχίστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός."
    gsub("[A-Za-z]","",en_chr)
    [1] ",           ,  ."
    gsub("[A-Za-z]","",cn_chr)
    [1] "敏捷的棕色狐狸跨过懒狗"