Search code examples
rregexregex-group

Is there any named group capture mechanism in R while dealing with regular expressions?


As a basic example consider the following data.frame:

df <- data.frame(
    colval = c(
        "line-01_tel=0000000001",
        "line-01_tel=0000000002",
        "line-01_tel=0000000003"
    )
)

Let's imagine that "0000000001", "0000000002", "0000000003" are telephon numbers that we want to extract by using named group capture. Using Python here is how I would poceed:

import re


def main():
    test_lst = [
        "line-01_tel=0000000001",
        "line-01_tel=0000000002",
        "line-01_tel=0000000003"
    ]
    regexp = r"=(?P<telnum>\d+)$"
    prog = re.compile(regexp, re.IGNORECASE)
    for item in test_lst:
        result = prog.search(item)
        if result:
            print("telnum = {}".format(result.group("telnum")))


if __name__ == "__main__":
    main()

Is it possible to have the equivalent of r"=(?P<telnum>\d+)$" and result.group("telnum") indicated in the above code in R? In other words, is there any named group capture mechanism in R while dealing with regular expressions?

I checked the Strings chapter of the online book "R for data science". There are functions such as str_match, str_sub, etc for working with regular expressions. But I didn't see any example of named group capture.


Solution

  • The namedCapture package has that capability.

    library(namedCapture)
    str_match_named(df$colval, "(?P<telnum>\\d+)$")
    ##      telnum      
    ## [1,] "0000000001"
    ## [2,] "0000000002"
    ## [3,] "0000000003"
    

    Also even without that package this works n base R

    m <- regexec("(?P<telnum>\\d+)$", df$colval, perl = TRUE)
    regmatches(df$colval, m)
    ## [[1]]
    ##                    telnum 
    ## "0000000001" "0000000001" 
    ##
    ## [[2]]
    ##                    telnum 
    ## "0000000002" "0000000002" 
    ##
    ## [[3]]
    ##                    telnum 
    ## "0000000003" "0000000003"