I had always heard that the [:alpha:]
character class was equivalent to [A-Za-z]
but in the output below this does not seem to be the case regardless if I use gsub
(with and without perl = TRUE
) or stringi
. It seems [:alpha:]
matches non-ascii characters but I may be misunderstanding. Using ?regex
tells me:
Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.
But I still don't get the difference. To me [A-Za-z]
matches exactly 52 characters, while [:alpha:]
matches way more.
Questions
[:alpha:]
class and [a-zA-Z]
[:alpha:]
match?[:alpha:]
work the same across different operating systems and locations around the world?x <- c(
"danish characteøs sentåment æcores words correctly 456",
"It works with probleme but not with problème 234"
)
gsub("[[:alpha:] ]", '', x)
## "456" "234"
gsub("[a-zA-Z ]", '', x)
## [1] "øåæ456" "è234"
stringi::stri_replace_all_regex(x, "[[:alpha:] ]", '')
## "456" "234"
stringi::stri_replace_all_regex(x, "[a-zA-Z ]", '')
## [1] "øåæ456" "è234"
[:alpha:]
stands for "alphabetic characters:". [:alpha:]
, as in the opposite to [:digit:]
. This includes literally every letter character in your character encoding. Whereas [a-zA-Z]
is capturing any character between the symbol 'a' and 'z', as well as 'A' and 'Z'. As @Charles Duffy noted the locale order of these can differ and so other characters can be contained. In standard English UTF-8, however, this will only include standard English letters (26 letters * 2 lower & upper case = 52), and thus will not include any letter from other languages, e.g., é, ö, ï, etc.
[:alpha:]
will match all alphabetic characters.
Yes, since [:alpha:]
matches all alphabetic characters it will work the same across different languages, operations systems or locations.
To give more context, the regex
function implemented in R (used by grepl
, regexpr
, gregexpr
, sub
or gsub
, among others) follows the POSIX 1003.2 standard. This means matching is based on:
the bit pattern used for encoding the character, not on the graphic representation of the character.
Below is an example of variations of different language characters for Sys.getlocale(category = "LC_ALL")
"en_GB.UTF-8":
fr_chr <- "Voix ambiguë d’un cœur qui au zéphyr préfère les jattes de kiwi."
ge_chr <- "Fix, Schwyz! quäkt Jürgen blöd vom Paß."
gr_chr <- "Ταχίστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός."
en_chr <- "Shaw, those twelve beige hooks are joined if I patch a young, gooey mouth."
cn_chr <- "敏捷的棕色狐狸跨过懒狗"
gsub("[[:alpha:]]","",fr_chr)
[1] " ’ ."
gsub("[[:alpha:]]","",ge_chr)
[1] ", ! ."
gsub("[[:alpha:]]","",gr_chr)
[1] " , ."
gsub("[[:alpha:]]","",en_chr)
[1] ", , ."
gsub("[[:alpha:]]","",cn_chr)
[1] ""
gsub("[A-Za-z]","",fr_chr)
[1] " ë ’ œ é éè ."
gsub("[A-Za-z]","",ge_chr)
[1] ", ! ä ü ö ß."
gsub("[A-Za-z]","",gr_chr)
[1] "Ταχίστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός."
gsub("[A-Za-z]","",en_chr)
[1] ", , ."
gsub("[A-Za-z]","",cn_chr)
[1] "敏捷的棕色狐狸跨过懒狗"