Search code examples
rregexstring-substitution

Extract text from inner-most nested parentheses of string


From the text string below, I am trying to extract a specific string subset.

string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)", 
            "scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)", 
            "I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)", 
            "scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)", 
            "scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)", 
            "I(scale(Slope_30)^2)")

A good result would return the central text without any special characters, as shown below.

Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
            "SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")

Preferably however, the resulting string would account for the ^2 and log associated with 'Slope' and 'SlopeVar', respectively. Specifically, all strings containing ^2 would be converted to 'SlopeSq' and all strings containing log would be converted to 'SlopeVarPs', as show below.

Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")

I have a long, ugly, and inefficient code sequence that gets me nearly halfway to the Good result and would appreciate any suggestions.


Solution

  • As a not-so-efficient coder, I like to have a chain of multiple regex to achieve the outcome (what each line of regex does is commented in each line):

    library(stringr)
    library(dplyr)
    string %>% 
      str_replace_all(".*log\\((.*?)(_.+?)?\\).*", "\\1Ps") %>% # deal with "log" entry
      str_replace_all(".*\\((.*?\\))", "\\1") %>% # delete anything before the last "(" 
      str_replace_all("(_\\d+)?\\)\\^2", "Sq") %>%  # take care of ^2
      str_replace_all("(_.+)?\\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")
    
    
    Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
              "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
    all(outcome == Best)
    ## TRUE