I have a vector of strings that look like this:
G30(H).G3(M).G0(L).Replicate(1)
Iterating over c("H", "M", "L")
, I would like to extract G30
(for "H
"), G3
(for "M
") and G0
(for "L
").
My various attempts have me confused - the regex101.com
debugger, e.g. indicates that (\w*)\(M\)
works just fine, but transferring that to R fails ...
Using the stringi
package and the outer()
function:
library(stringi)
strings <- c(
"G30(H).G3(M).G0(L).Replicate(1)",
"G5(M).G11(L).G6(H).Replicate(9)",
"G10(M).G6(H).G8(M).Replicate(200)" # No "L", repeated "M"
)
targets <- c("H", "M", "L")
patterns <- paste0("\\w+(?=\\(", targets, "\\))")
matches <- outer(strings, patterns, FUN = stri_extract_first_regex)
colnames(matches) <- targets
matches
# H M L
# [1,] "G30" "G3" "G0"
# [2,] "G6" "G5" "G11"
# [3,] "G6" "G10" NA
This ignores any instances of a target letter past the first, gives you an NA
when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns
match substrings like XX(Y)
, where Y
is the target letter and XX
is any number of word characters.