Search code examples
rdatedatetimedata-munging

Munging dates in R


I'm working on a data-munging pipeline with a lot of date columns in the data. Many R functions (e.g., set operations, sapply, etc.) do no preserve date class, converting the dates to integers.

The strategies I see to deal with this are:

  1. Making sure that each function in the data-munging pipeline accepts and returns dates formatted as dates. Disadvantage: figuring out all the places to stick as.Date() is often tedious.
  2. Living with dates as integers in all the munging steps, converting them to dates only at the end. This has the disadvantage of making date manipulation (e.g., sequencing with by = "month") in the intermediate munging steps impossible without first converting to date.

Any other options I'm missing? Is there a way to make R play nice with dates? To clarify, the data I'm dealing with is not just a time series: multiple columns contain dates. So, as far as I can tell, xts is of limited usefulness.


Solution

  • The does not preserve Date class misfeature is an artefact of R itself, and how some base R functions are implemented. See e.g.

    R> dates <- Sys.Date() + 0:2
    R> for (d in dates) cat(d, "\n")
    17532 
    17533 
    17534 
    R> 
    

    Essentially, the S3 class attributes gets dropped when you do certain vector operations:

    R> as.vector(dates)
    [1] 17532 17533 17534
    R> 
    

    So my recommendation is to pick a good container type you like and stick with it to do the operations there. I like data.table a lot for this. A quick example:

    R> suppressMessages(library(data.table))
    R> dt <- data.table(date=Sys.Date()+0:2, other=Sys.Date() + cumsum(runif(3)*100))
    R> dt[, diff:=other-date][]
             date      other           diff
    1: 2018-01-01 2018-03-30  88.88445 days
    2: 2018-01-02 2018-06-09 158.23913 days
    3: 2018-01-03 2018-07-30 208.62187 days
    R> dt[, month:=month(other)][]
             date      other           diff month
    1: 2018-01-01 2018-03-30  88.88445 days     3
    2: 2018-01-02 2018-06-09 158.23913 days     6
    3: 2018-01-03 2018-07-30 208.62187 days     7
    R> 
    

    Not only does the Date type persist (as evidenced by the difference operation returning a difftime object), but you also gets lots of helper functions (like month()) here. Grouping by date is also natural.