Search code examples
rcharactergsubstringr

Regex, get all character before the second letter


I have a vector that is constructed with numbers and letters. I want to get all the characters before the LAST letter of each value (which is I guess, always the 2nd letter of the vector). Using stringr (preferably)...

Example :

x = c("1H23456789H10", "97845784584H2", "0H987654321H0", "0P45454545A3", "63A00000000000A91")
str_extract_all(string = x, pattern = ????????)

I tried some tricks here : https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

The result I want is :

"1H23456789"     instead of "1H23456789H10"
"97845784584"  instead of "97845784584H2"
"0H987654321",   instead of "0H987654321H0"
"0P45454545",    instead of "0P45454545A3"
"63A00000000000" instead of "63A00000000000A91"

Solution

  • str_extract(string = x, pattern = "[^A-Z]*[A-Z][^A-Z]*")
    # [1] "1H23456789"     "0H987654321"    "0P45454545"     "63A00000000000"
    

    Explanation: we want to extract 1 pattern match per input, so we use str_extract not str_extract_all. Our pattern [^A-Z]*, any number of non-letters, followed by [A-Z] exactly one letter, followed by [^A-Z]* any number of non-letters. I just used capital letters based on your input, but you could change A-Z to A-Za-z inside the brackets to include lower case letters.