Search code examples
rregexstringi

R: Regex madness (stringi)


I have a vector of strings that look like this:

G30(H).G3(M).G0(L).Replicate(1)

Iterating over c("H", "M", "L"), I would like to extract G30 (for "H"), G3 (for "M") and G0 (for "L").

My various attempts have me confused - the regex101.com debugger, e.g. indicates that (\w*)\(M\) works just fine, but transferring that to R fails ...


Solution

  • Using the stringi package and the outer() function:

    library(stringi)
    
    strings <- c(
      "G30(H).G3(M).G0(L).Replicate(1)",
      "G5(M).G11(L).G6(H).Replicate(9)",
      "G10(M).G6(H).G8(M).Replicate(200)"  # No "L", repeated "M"
    )
    targets  <- c("H", "M", "L")
    patterns <- paste0("\\w+(?=\\(", targets, "\\))")
    matches  <- outer(strings, patterns, FUN = stri_extract_first_regex)
    colnames(matches) <- targets
    matches
    #      H     M    L    
    # [1,] "G30" "G3" "G0" 
    # [2,] "G6"  "G5" "G11"
    # [3,] "G6"  "G10" NA
    

    This ignores any instances of a target letter past the first, gives you an NA when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns match substrings like XX(Y), where Y is the target letter and XX is any number of word characters.