Search code examples
rstringspecial-charactersalphanumeric

extracting alphanumeric patterns from a character string whose values vary in R


my gratitude in advance for any help and apologies for not being able to figure this out from other examples.

I have a vector containing names of files such as: vec = c("Img_1_(set1)_2L4_s.ext", "Img_37_(set19)_2R4_s.ext", "Img_187_(set94)_4L4_s.ext", "Img_77_(set39)_4R2_s.ext")

I want to create two--separate--additional vectors from extracting:

1. The key letter (either L or R) between the numbers that go side-by-side, which vary from case to case. e.g., result: L,R,L,R

2. The "set" string, plus the number--which varies across cases--attached to it between brackets, with and without the brackets. e.g., result1: (set1), (set19), (set94), (set39); result2: set1, set19, set94, set39

Ideally using either stringer(), but I'm open to other --simpler?-- libraries/functions.

For case 1., I tried str_extract(vec, "(?<= \\)_)[0-9]*"), as a way to get the ")_" pattern followed by a number [0-9] but all I get in return are NAs (I think I'm not quite passing alright the ")" pattern well).

For case 2., I had to made do by simply extracting the set numbers str_extract(vec, "(?<=set)[0-9]*"), and create another variable by pasting the "set" word; obviously not ideal with large data frames.


Solution

  • The set pattern is nice and easy, the letters "set" followed by one more more numbers "[0-9]+".

    At least for your examples, it seems like the letters L and R don't show up anywhere else, so we can do a very simple pattern for them too, just look for an L or an R: "L|R".

    set = str_extract(vec, pattern = "set[0-9]+")
    main = str_extract(vec, pattern = "L|R")
    set
    # [1] "set1"  "set19" "set94" "set39"
    main
    # [1] "L" "R" "L" "R"
    

    If you're worried about potentially getting false hits on the L or R because they might show up elsewhere in the input, you could make the pattern more specific, for example looking behind for a number "(?<=[0-9])" and looking ahead for a number "(?=[0-9])":

    main2 = str_extract(vec, pattern = "(?<=[0-9])L|R(?=[0-9])")
    main2
    # [1] "L" "R" "L" "R"
    

    And if you do want the parens with the set, you escape parens to include them in the pattern:

    set_with_paren = str_extract(vec, pattern = "\\(set[0-9]+\\)")
    set_with_paren
    # [1] "(set1)"  "(set19)" "(set94)" "(set39)"