Search code examples
rregextext-extractionstringrstringi

Extract only the characters that are between opening and ending parantheses in the start and end of a string in R


I have many strings that all have the following format:

mystrings <- c(
  "(ABFUHIASH)THISISAVERYLONGSTRINGWITHOUTANYSPACES(ENDING)",
  "(SECONDSTR)YETANOTHERBORINGSTRINGWITHOUTSPACES(RANDOMENDING)", 
  "(JOWERIC)THISPARTSHOULDNOTBEEXTRACTED(GETTHIS)", 
  "(CAPTURETHIS)IOJSDOIOIADSNCXZZCX(IJFAI)"
)

I need to capture the strings that are inside parentheses both at the start and the end of the original mystrings.

Therefore, variable start will store the starting characters for each of the above strings with the same index. The result will be this:

start[1]
ABFUHIASH

start[2]
SECONDSTR

start[3]
JOWERIC

start[4]
CAPTURETHIS

And similarly, the ending for each string in mystrings will be saved into end:

end[1]
ENDING

end[2]
RANDOMENDING

end[3]
GETTHIS

end[4]
IJFAI

Parentheses themselves should NOT be captured.

Is there a way/function to do this quickly in R?

I have tried stringr::word and stringi::stri_extract, but I am getting very strange results.


Solution

  • We can use the stringr library for this. For example

    library(stringr)
    mm <- str_match(mystrings, "^\\(([^)]+)\\).*\\(([^)]+)\\)$")
    mm
    

    The match finds the stuff between the parenthesis at the beginning and end of the string in capture groups so they can be easily extracted.

    It returns a character matrix, and you seem to just want the 2nd and 3rd column. mm[,2:3]

         [,1]          [,2]          
    [1,] "ABFUHIASH"   "ENDING"      
    [2,] "SECONDSTR"   "RANDOMENDING"
    [3,] "JOWERIC"     "GETTHIS"     
    [4,] "CAPTURETHIS" "IJFAI"