Search code examples
rregexgsubstring-substitutiongsubfn

Which regular expression is more appropriate?


I am trying to make models output prettier with pre-defined labels for my variables. I have a vector of variable names (a), a vector of labels (b) and model terms (c).

I have to match the vectors (a) and (c) and replace (a) by (b). I found this question that introduced me to the function gsubfn from the package library(gsubfn). The function match and replace multiple strings. Following their example, it did not work properly in my case:

library(gsubfn)

a <- c("ecog.ps", "resid.ds", "rx")
b <- c("ECOG-PS", "Residual Disease", "Treatment")
c <- c("ecog.psII", "rxt2", "ecog.psII:rxt2")

gsubfn("\\S+", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

If I use a specific pattern, then it works:

gsubfn("ecog.ps", setNames(as.list(b), a), c)
[1] "ECOG-PSII"      "rxt2"           "ECOG-PSII:rxt2"

So I guess my problem is the regular expression used as the argument pattern in the function gsubfn. I checked this R-pub, and Hadley's book for regular expressions. It seems that \S+ is adequate. I tried other regular expressions without success:

gsubfn("[:graph:]", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

gsubfn("[:print:]", setNames(as.list(b), a), c)
[1] "ecog.psII"      "rxt2"           "ecog.psII:rxt2"

Which pattern should be used in the function gsubfn to match the vectors (a) and (c) and replace (a) by (b)?


Solution

  • The \S+ pattern fully matches ecog.psII and ecog.psII:rxt2 and the list has no items with such names. You may create a pattern dynamically from the a vector and use it to find the matches you need.

    Use

    pat <- paste(a, collapse="|")
    ## Or, if there can be special chars that must be escaped (note . must also be escaped)
    pat <- paste(gsub("([][/\\\\^$*+?.()|{}-])", "\\\\\\1", a), collapse="|")
    ## => ecog\.ps|resid\.ds|rx
    

    and then use

    gsubfn(pat, setNames(as.list(b), a), c)
    

    If you do not escape special chars, you may overmatch (since . matches any char), match wrong strings (if there are quantifiers or other regex operators) or an error might occur (if there are chars like (, ), unpaired [, etc.).