Search code examples
rregexpattern-matchingparentheses

Regex: Extracting numbers from parentheses with multiple matches


How do I match the year such that it is general for the following examples.

a <- '"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}'
b <- 'Þegar það gerist (1998/I) (TV)'

I have tried the following, but did not have the biggest success.

gsub('.+\\(([0-9]+.+\\)).?$', '\\1', a)

What I thought it did was to go until it finds a (, then it would make a group of numbers, then any character until it meets a ). And if there are several matches, I want to extract the first group.

Any suggestions to where I go wrong? I have been doing this in R.


Solution

  • Your pattern contains .+ parts that match 1 or more chars as many as possible, and at best your pattern could grab last 4 digit chunks from the incoming strings.

    You may use

    ^.*?\((\d{4})(?:/[^)]*)?\).*
    

    Replace with \1 to only keep the 4 digit number. See the regex demo.

    Details

    • ^ - start of string
    • .*? - any 0+ chars as few as possible
    • \( - a (
    • (\d{4}) - Group 1: four digits
    • (?: - start of an optional non-capturing group
      • / - a /
      • [^)]* - any 0+ chars other than )
    • )? - end of the group
    • \) - a ) (OPTIONAL, MAY BE OMITTED)
    • .* - the rest of the string.

    See the R demo:

    a <- c('"You Are There" (1953) {The Death of Socrates (399 B.C.) (#1.14)}', 'Þegar það gerist (1998/I) (TV)', 'Johannes Passion, BWV. 245 (1725 Version) (1996) (V)')
    sub("^.*?\\((\\d{4})(?:/[^)]*)?\\).*", "\\1", a) 
    # => [1] "1953" "1998" "1996"
    

    Another base R solution is to match the 4 digits after (:

    regmatches(a, regexpr("\\(\\K\\d{4}(?=(?:/[^)]*)?\\))", a, perl=TRUE))
    # => [1] "1953" "1998" "1996"
    

    The \(\K\d{4} pattern matches ( and then drops it due to \K match reset operator and then a (?=(?:/[^)]*)?\\)) lookahead ensures there is an optional / + 0+ chars other than ) and then a ). Note that regexpr extracts the first match only.