Search code examples
rlubridate

Conversion error using the parse_date_time() function of lubridate in R


Using lubridate's function parse_date_time() in R to convert character strings of dates into date class for the vector

x <- c("12 April, 1971", "2015-12-20", "21/08/2021", "06/23/97", "Oct 10, 2010")

I get the first element as "1971-12-19 UTC" instead of "1971-04-12 UTC". Here is the code used with the parse_date_time function:

dates <- parse_date_time(x, orders = c("d B, Y", "Y-m-d", "d/m/Y", "m/d/y", "b d, Y"))


> dates
[1] "1971-12-19 UTC" "2015-12-20 UTC" "2021-08-21 UTC" "1997-06-23 UTC" "2010-10-10 UTC"

I also tried R base function as.Date() and got the same error.


Solution

  • You can leave out the training part which takes care of the order by analysing the vector, sometimes that doesn't lead to the right results. Keep in mind that the results vary once the data changes.

    library(lubridate)
    
    parse_date_time(x, 
      orders = c("d b, Y", "Y-m-d", "d/m/Y", "m/d/y", "b d, Y"), train=F)
    [1] "1971-04-12 UTC" "2015-12-20 UTC" "2021-08-21 UTC" "1997-06-23 UTC"
    [5] "2010-10-10 UTC"
    

    According to the docs, this still does some guessing. To completely turn off training and guessing, this should work, too (notice the added % with exact=T)

    parse_date_time(x, 
      orders = c("%d %B, %Y", "%Y-%m-%d", "%d/%m/%Y", "%m/%d/%y", "%b %d, %Y"), exact=T)
    

    Regarding when to use training (and guessing), like the wording might suggest, may depend on your data set. If you know you only have some rare odd format sparsely spread throughout the data training might be able to detect these without setting the right order at first. You will have to run tests to get infos about the performance of the training no matter what.

    Why can training be wrong?

    If you have a limited/small data set the initially set priorities can be the wrong guess which will lead to false orders. Like said in the beginning, this will change once the data grows. Unfortunately there will be no definitive threshold where training will lead to 100% correct results.


    A base R as.Date approach

    lapply(x, as.Date, 
      tryFormats = c("%d %B, %Y", "%Y-%m-%d", "%d/%m/%Y", "%m/%d/%y", "%b %d, %Y"))
    [[1]]
    [1] "1971-04-12"
    
    [[2]]
    [1] "2015-12-20"
    
    [[3]]
    [1] "2021-08-21"
    
    [[4]]
    [1] "1997-06-23"
    
    [[5]]
    [1] "2010-10-10"