Search code examples

R: Regex madness (stringi)

I have a vector of strings that look like this:


Iterating over c("H", "M", "L"), I would like to extract G30 (for "H"), G3 (for "M") and G0 (for "L").

My various attempts have me confused - the debugger, e.g. indicates that (\w*)\(M\) works just fine, but transferring that to R fails ...


  • Using the stringi package and the outer() function:

    strings <- c(
      "G10(M).G6(H).G8(M).Replicate(200)"  # No "L", repeated "M"
    targets  <- c("H", "M", "L")
    patterns <- paste0("\\w+(?=\\(", targets, "\\))")
    matches  <- outer(strings, patterns, FUN = stri_extract_first_regex)
    colnames(matches) <- targets
    #      H     M    L    
    # [1,] "G30" "G3" "G0" 
    # [2,] "G6"  "G5" "G11"
    # [3,] "G6"  "G10" NA

    This ignores any instances of a target letter past the first, gives you an NA when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns match substrings like XX(Y), where Y is the target letter and XX is any number of word characters.