Search code examples
rdplyrtidyverselubridate

Differing entries within a date column, objective being to preserve columns before removing. How to best clean a "date" column like this?


structure(list(year = c("Mar-10", "2014", "May-August", 
"2009/2010", "2015", NA_character_), date = c("August 31st, 2010", "March 13th, 2015", 
"May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

# # A tibble: 6 × 2
#   year       date             
#   <chr>      <chr>            
# 1 Mar-10     August 31st, 2010
# 2 2014       March 13th, 2015 
# 3 May-August May 31st, 2010   
# 4 2009/2010  June 16th, 2010  
# 5 2015       May 18th, 2010   
# 6 NA         April 7th, 2010 

My goal is to preserve as many columns as possible before I start removing erroneous entries relating to column 1, hopefully by simplifying entries to a simple year value, as found in row 2 of this example set.

In the case of NA values, rather than removing, I would like to paste data from the next column.

Expected output:
# # A tibble: 6 × 2
#   year  date             
#   <chr> <chr>            
# 1 2010  August 31st, 2010
# 2 2014  March 13th, 2015 
# 3 2010  May 31st, 2010   
# 4 2010  June 16th, 2010  
# 5 2015  May 18th, 2010   
# 6 2010  April 7th, 2010

In plain english, if the field contains an acceptable value, such as "2014", leave as is. If it contains a value that still confirms a year, such as "Mar-10", use 2010. If the year is impossible to determine, such as the case for "May-August", "2009/2010", or NA value, use year from column Date instead.


Solution

  • You can use coalesce + str_extract:

    library(dplyr)
    library(stringr)
    
    df %>%
      mutate(year = coalesce(str_extract(year, "^\\d{4}$"), str_extract(date, "\\d{4}")))
    
    # # A tibble: 6 × 2
    #   year  date             
    #   <chr> <chr>            
    # 1 2010  August 31st, 2010
    # 2 2014  March 13th, 2015 
    # 3 2010  May 31st, 2010   
    # 4 2010  June 16th, 2010  
    # 5 2015  May 18th, 2010   
    # 6 2010  April 7th, 2010