Search code examples
rregexsquare-bracketmetacharacters

R - regex: W metacharacter not working when within square brackets


Let's take the following string:

x <- " hello world"

I would like to extract the first word. To do so, I am using the following regex ^\\W*([a-zA-Z]+).* with a back-reference to the first group.

> gsub("^\\W*([a-zA-Z]+).*", "\\1", x)
[1] "hello"

It works as expected.

Now, let's add a digit and underscore to our string:

x <- " 0_hello world"

I replace \\W by [\\W_0-9] to match the new characters.

> gsub("^[\\W_0-9]*([a-zA-Z]+).*", "\\1", x)
[1] " 0_hello world"

Now, it doesn't work and I do not understand why. It seems that the problem arises when putting \\W within [] but I am not sure why. The regex works on online regex tester using PCRE though.

What am I doing wrong?


Solution

  • The quick solution is to use Perl-like Regular Expressions by adding an additional argument perl = TRUE.

    By default, grep use Extended Regular Expressions (see ?regex) where character classes are defined in the format of [:xxx:]. However, I could not find a character class to match \W exactly.