Search code examples
rstringfindextractcapture

R - Extract String between two strings


I want to take a string variable that has a lot of text in it, search until it finds a match "UpperBoundery" and then searches until it sees text after that upper boundary until it finds another match "LowerBoundery" then return to me the text that is between those two boundaries.

For example, the upper boundary would be ""Country":"" and the ending boundary would be "",".

This is a snip of what the text I'm dealing with looks like:

> }],"Country":"United States",
> }],"Country":"China",

So I want the results to come back:

> United States
> China

What code or function can people share with me to do this? I've been looking forever and tried numerious things (stri, grep, find, etc.) but I can't get anything to do what I'm looking for. Thank you for your help!


Solution

  • Here's a regex method, though as I mentioned in comments I'd strongly recommend using, e.g., the jsonlite package instead.

    # input:
    x = c('> }],"Country":"United States",', 
    '> }],"Country":"China",')
    
    library(stringr)
    result = str_extract(x, pattern = '(?<=Country":")[^,]+(?=",)')
    result
    # [1] "United States" "China" 
    

    Explanation:

    • (?<=...) is the look-behind pattern. So we're looking behind (before) the match for Country":".
    • [^"]+ is our main pattern - ^ in brackets is "not", so we're looking for any character that is not a ". And + is the quantifier, so one or more non-" characters.
    • (?=...) is the look-ahead pattern. So we're looking after the match for ","