Search code examples
regexrregex-groupcapturing-group

Named capture in regexp


I need the ability to capture groups in regular expressions using names in r. I test the code explained in this site [Rd] Named capture in regexp and the example works without problem. I try to adapt this code to solve simple regular expression.

(xxxx)(?<id>\w{4})(?<number>\d{5})

For more details see the code here

I try to do it in r

regex =  "(xxxx) (?<id>[0-9A-Za-z]{4}) (?<number>[0-9]{5})"
notable = "xxxxcn0700814"
regexpr(regex,notable,perl = TRUE)

and it was my output for this code

[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
attr(,"capture.start")
        id number
[1,] -1      -1   -1
attr(,"capture.length")
        id number
[1,] -1      -1   -1
attr(,"capture.names")
[1] ""        "id" "number"  

I can see what is the problem with this because this code is similar to the code of web page.

Thanks in advance


Solution

  • If you want to make the whitespace in the PCRE regex formatting, just use the (?x) inline modifier:

    regex =  "(?x)(xxxx) (?<id>[0-9A-Za-z]{4}) (?<number>[0-9]{5})"
              ^^^^
    

    See the R online demo

    If you want to match a literal space with this modifier, you will have to escape it, or use inside a character class. If you need to match any whitespace, use \s shorthand.

    If you do not need all these "prettifying" stuff, just remove the spaces from your pattern since without (?x) they are meaningful:

    regex =  "(xxxx)(?<id>[0-9A-Za-z]{4})(?<number>[0-9]{5})"
    

    Note the literal # symbol must also be escaped to denote a literal # symbol. Also, whitespace inside character classes ([...]) is treated as a literal whitespace and you can use (?#:...) comments inside the PCRE regex pattern with the (?x) modifier.