Differing entries within a date column, objective being to preserve columns before removing. How to best clean a "date" column like this?

structure(list(year = c("Mar-10", "2014", "May-August", 
"2009/2010", "2015", NA_character_), date = c("August 31st, 2010", "March 13th, 2015", 
"May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

# # A tibble: 6 × 2
#   year       date             
#   <chr>      <chr>            
# 1 Mar-10     August 31st, 2010
# 2 2014       March 13th, 2015 
# 3 May-August May 31st, 2010   
# 4 2009/2010  June 16th, 2010  
# 5 2015       May 18th, 2010   
# 6 NA         April 7th, 2010

My goal is to preserve as many columns as possible before I start removing erroneous entries relating to column 1, hopefully by simplifying entries to a simple year value, as found in row 2 of this example set.

In the case of NA values, rather than removing, I would like to paste data from the next column.

Expected output:

# # A tibble: 6 × 2
#   year  date             
#   <chr> <chr>            
# 1 2010  August 31st, 2010
# 2 2014  March 13th, 2015 
# 3 2010  May 31st, 2010   
# 4 2010  June 16th, 2010  
# 5 2015  May 18th, 2010   
# 6 2010  April 7th, 2010

In plain english, if the field contains an acceptable value, such as "2014", leave as is. If it contains a value that still confirms a year, such as "Mar-10", use 2010. If the year is impossible to determine, such as the case for "May-August", "2009/2010", or NA value, use year from column Date instead.

Solution

You can use coalesce + str_extract:

library(dplyr)
library(stringr)

df %>%
  mutate(year = coalesce(str_extract(year, "^\\d{4}$"), str_extract(date, "\\d{4}")))

# # A tibble: 6 × 2
#   year  date             
#   <chr> <chr>            
# 1 2010  August 31st, 2010
# 2 2014  March 13th, 2015 
# 3 2010  May 31st, 2010   
# 4 2010  June 16th, 2010  
# 5 2015  May 18th, 2010   
# 6 2010  April 7th, 2010