Search code examples
rreadr

Parse string, extract two-digit year and complete into four digit format


I have strings like

y1 <- "AB99"
y2 <- "04CD"
y3 <- "X90Z"
y4 <- "EF09"
y5 <- "12GH"

where I need to extract the two digit year and complete it into a four digit format. The input range is from 1990 - 2020.

The output should be:

"1999"
"2004"
"1990"
"2009"
"2012"

I tried:

fun <- function(x) {
  year <- readr::parse_number(x)
  if(year < 50) year <- paste0("20", year) else year <- paste0("19", year)
  return(year)
}

This works fine, except for the years 2000 - 2009 (testcase y2 and y4).

Which functions can help me to also work fine on those years?


Solution

  • Using some basic regex, you can remove everything that is not a number and apply an ifelse() to prefix 19 or 20 as appropriate:

    # Example data
    y <- c(
      y1 = "AB99",
      y2 = "04CD",
      y3 = "X90Z",
      y4 = "EF09",
      y5 = "12GH"
    )
    
    # Extract only the number
    num <- gsub("\\D", "", y) 
    paste0(ifelse(num >= "90", "19", "20"), num)
    # [1] "1999" "2004" "1990" "2009" "2012"
    

    Alternatively, working with integers:

    num <- as.integer(gsub("\\D", "", y)) # or as.integer(readr::parse_number(y))
    num + ifelse(num >= 90L, 1900L, 2000L)
    # [1] 1999 2004 1990 2009 2012