Search code examples
rdataframedata-cleaning

Messy date formats in data frame


I created a task for myself that I cannot solve - there is a dataframe with start dates and end dates of some projects. Some elements are wrong and show the duration of a project instead of the end date.

start_date <- c("2017-05-04", "2016-04-01", "2013-12-12", "2011-05-11", "2010-04-10", "2009-01-01")
end_date <- c("2020-01-01", "2020-01-06", "3 years", "36 months", "2020-01-01", "2020-01-01")
df <- data.frame(start_date, end_date)

start_date   end_date
1 2017-05-04 2020-01-01
2 2016-04-01 2020-01-06
3 2013-12-12    3 years
4 2011-05-11  36 months
5 2010-04-10 2020-01-01
6 2009-01-01 2020-01-01


How to calculate them and convert into Date format? Moreover, data structures of start_date and end_date are factors.


Solution

  • You can use as.Date on end date then lubridate::as.duration on the values that fail (i.e. are NA):

    library(lubridate)
    #> 
    #> Attaching package: 'lubridate'
    #> The following object is masked from 'package:base':
    #> 
    #>     date
    start_date <- c("2017-05-04", "2016-04-01", "2013-12-12", "2011-05-11", "2010-04-10", "2009-01-01")
    end_date <- c("2020-01-01", "2020-01-06", "3 years", "36 months", "2020-01-01", "2020-01-01")
    df <- data.frame(start_date = as.Date(start_date), end_date, stringsAsFactors = FALSE)
    df$EndDate <- as.Date(df$end_date)
    
    for (i in which(is.na(df$EndDate))) {
      df[i, ]$EndDate <- as.Date(df[i, ]$start_date + as.duration(df[i, ]$end_date))
    }
    df
    #>   start_date   end_date    EndDate
    #> 1 2017-05-04 2020-01-01 2020-01-01
    #> 2 2016-04-01 2020-01-06 2020-01-06
    #> 3 2013-12-12    3 years 2016-12-11
    #> 4 2011-05-11  36 months 2014-05-10
    #> 5 2010-04-10 2020-01-01 2020-01-01
    #> 6 2009-01-01 2020-01-01 2020-01-01