structure(list(year = c("Mar-10", "2014", "May-August",
"2009/2010", "2015", NA_character_), date = c("August 31st, 2010", "March 13th, 2015",
"May 31st, 2010", "June 16th, 2010", "May 18th, 2010", "April 7th, 2010")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 Mar-10 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 May-August May 31st, 2010
# 4 2009/2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 NA April 7th, 2010
My goal is to preserve as many columns as possible before I start removing erroneous entries relating to column 1, hopefully by simplifying entries to a simple year value, as found in row 2 of this example set.
In the case of NA values, rather than removing, I would like to paste data from the next column.
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 2010 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 2010 May 31st, 2010
# 4 2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 2010 April 7th, 2010
In plain english, if the field contains an acceptable value, such as "2014", leave as is. If it contains a value that still confirms a year, such as "Mar-10", use 2010. If the year is impossible to determine, such as the case for "May-August", "2009/2010", or NA value, use year from column Date instead.
You can use coalesce
+ str_extract
:
library(dplyr)
library(stringr)
df %>%
mutate(year = coalesce(str_extract(year, "^\\d{4}$"), str_extract(date, "\\d{4}")))
# # A tibble: 6 × 2
# year date
# <chr> <chr>
# 1 2010 August 31st, 2010
# 2 2014 March 13th, 2015
# 3 2010 May 31st, 2010
# 4 2010 June 16th, 2010
# 5 2015 May 18th, 2010
# 6 2010 April 7th, 2010