Let's take the following string:
x <- " hello world"
I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).*
with a back-reference to the first group.
> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"
It works as expected.
Now, let's add a digit and underscore to our string:
x <- " 0_hello world"
I replace \\W
by [\\W_0-9]
to match the new characters.
> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"
Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W
within []
but I am not sure why.
The regex works on online regex tester using PCRE though.
What am I doing wrong?
The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE
.
By default, grep
use Extended Regular Expressions (see ?regex
) where character classes are defined in the format of [:xxx:]
. However, I could not find a character class to match \W
exactly.