Search code examples
rextract

Regular expression in R: Extracting the year from a string


I have a character vector with each element denoting the time of data collection. Unfortunately, the elements do not follow the same pattern:

"05.1990 - 06.1990, Poland"
"11.05.1990 - 13.07.1990, Portugal"
"1993 - 1993, Romania"

Is there a neat way, using regular expressions, to extract:

  1. The year when the data collection started (the first four digits before the dash)
  2. The year when the data collection ended (the first four digits before the comma)

If possible, I'd like to have two different regular expressions for (1) and (2).


Solution

  • You can do this using positive lookaheads. Here's an example using {stringr}

    x <- c(
      "05.1990 - 06.1990, Poland",
      "11.05.1990 - 13.07.1990, Portugal",
      "1993 - 1993, Romania"
    )
    
    # The year when the data collection started (the first four digits before the dash)
    stringr::str_extract(x, "\\d{4}(?=\\s*-)")
    #> [1] "1990" "1990" "1993"
    
    # The year when the data collection ended (the first four digits before the comma)
    stringr::str_extract(x, "\\d{4}(?=,)")
    #> [1] "1990" "1990" "1993"
    

    Created on 2022-10-14 with reprex v2.0.2