Formatting date column with different formats (including missing day information) - lubridate

I'm relatively new to R. I downloaded a dataset about clinical trial data, but it occurred to me, that the format of the dates in the relative column are mixed up: most of them are like "September 1, 2012", but some are missing the day information (e.g. October 2015).

I want to express them all in the same way (eg. yyyy-mm-dd), to work with them. That went fine, the only problem that is missing is the name of the output column. In the last function (date_correction) I planned to include an argument "output_col" which I can pass the intended name for the created (formatted) column, but it only prints output_col all the time.

Do you know, how I could handle this? To pass the intended name of the output column right into the function?
Is there a better way to solve my problem? -> I even tried to manage more complex orders-argument for lubricate::parse_date_time like

parse_date_time(input_col, orders="mdy", "my")

but this didn't work.

Here's the code:

library("tidyverse")
library("lubridate")

Observation <- c(seq(1:5))
Date_original <- c("October 2014","August 2014","June 2013",
                   "June 24, 2010","January 2005")

df_dates <- data.frame(Observation, Date_original)

# looking for a comma in the cell
comma_detect <- function(a_string){
  str_detect(a_string, ",")
}

# if comma: assume "mdy", if not apply "my" -> return formatted value
date_correction_row <- function(input_col){
  if_else(comma_detect(input_col),
          parse_date_time(input_col, orders="mdy"),
          parse_date_time(input_col, orders="my"))
}

# prepare function for dataframe:
date_correction <- function(df, input_col, output_col){
  mutate(df, output_col = date_correction_row(input_col))
}

df_dates %>% date_correction(df_dates$Date_original, date_formatted) %>% view()

OUTPUT

  Observation Date_original output_col
1           1  October 2014 2014-10-01
2           2   August 2014 2014-08-01
3           3     June 2013 2013-06-01
4           4 June 24, 2010 2010-06-24
5           5  January 2005 2005-01-01

Solution

In the code below we assume that output_col equals "Date". They all set the column name, give no warnings and use Date class.

1) Try each format and take the one that does not give NA. This uses only base R.

output_col <- "Date"

within(df_dates, assign(output_col, pmin(na.rm = TRUE,
 as.Date(Date_original, "%B %d, %Y"), 
 as.Date(paste(Date_original, 1), "%B %Y %d"))))
##   Observation Date_original       Date
## 1           1  October 2014 2014-10-01
## 2           2   August 2014 2014-08-01
## 3           3     June 2013 2013-06-01
## 4           4 June 24, 2010 2010-06-24
## 5           5  January 2005 2005-01-01

2) This can also be done in lubridate. It is important that my is the first rather than second argument to coalesce since it outputs NA for those values that do not match the format whereas mdy gives a wrong date so if that were first coalesce would never get to my. This approach is shorter than (3) but you might prefer the robustness (3) since it does not depend on what is returned for non-matching dates.

library(dplyr)
library(lubridate)

output_col <- "Date"

df_dates %>% 
  mutate(!!output_col := coalesce(my(Date_original, quiet = TRUE), 
    mdy(Date_original)))
##   Observation Date_original       Date
## 1           1  October 2014 2014-10-01
## 2           2   August 2014 2014-08-01
## 3           3     June 2013 2013-06-01
## 4           4 June 24, 2010 2010-06-24
## 5           5  January 2005 2005-01-01

3) If you prefer your own method of first checking for comma here is a variation of that which is more compact. It uses my and mdy instead of parse_date_time since my and mdy give Date class results which are more appropriate here than the POSIXct of parse_date_time given that there are no times.

library(dplyr)
library(lubridate)

output_col <- "Date"

df_dates %>%
  mutate(!!output_col := if_else(grepl(",", Date_original),
    mdy(Date_original), my(Date_original, quiet = TRUE)))
## 1           1  October 2014 2014-10-01
## 2           2   August 2014 2014-08-01
## 3           3     June 2013 2013-06-01
## 4           4 June 24, 2010 2010-06-24
## 5           5  January 2005 2005-01-01