Search code examples
rlapplysapplylubridate

Why does lubridate's parse_date_time work with lapply, but fail with sapply?


Given: the following 4x2 dataframe

df <- as.data.frame(
  stringsAsFactors = FALSE,
  matrix(
    c("2014-01-13 12:08:02", "2014-01-13 12:19:46",
      "2014-01-14 09:59:09", "2014-01-14 10:05:09",
      "6-18-2016 17:43:42",  "6-18-2016 18:06:59",
      "6-27-2016 12:16:47",  "6-27-2016 12:29:05"),
    nrow = 4, ncol = 2, byrow = TRUE
  )
)
colnames(df) <- c("starttime", "stoptime")

Goal: the same dataframe but with all the values replaced by the return value of the following lubridate function call:

f <- function(column) {
  parse_date_time(column, orders = c ("ymd_hms", "mdy_hms"), tz = "ETZ")
}

Here's the sapply call, whose result contains strange integers:

df2 <- sapply(df, FUN = f) # has values like `1467030545`

And here's the lapply call, that works as expected:

df2 <- lapply(df, FUN = f) # has values like `2016-06-27 12:29:05`

I understand sapply returns the simplest data structure it can while lapply returns a list. I was prepared to follow up the sapply call with df2 <- data.frame(df2) to end up with a data frame as desired. My question is:

Why does the parse_date_time function behave as expected in the lapply but not in the sapply?


Solution

  • The reason is that sapply have by default simplify = TRUE and when the length or dimension of the list elements are same, it simplifies to a vector or matrix. Internally, Date time classes are stored as numeric,

    typeof(parse_date_time(df$starttime, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ"))
    #[1] "double"
    

    while the class is 'POSIXct`

    class(parse_date_time(df$starttime, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ"))
    #[1] "POSIXct" "POSIXt"  
    

    so it coerces to that while doing the matrix conversion, while in the list it preserves the class format.

    If we are interested in a data.frame, then we create a copy of 'df' and use [] to get the same structure as 'df'

    df2 <- df
    df2[] <-  lapply(df, FUN = function(column) {
         parse_date_time(column, orders = c("ymd_hms", "mdy_hms"), tz = "ETZ")
       })
    
    df2
    #           starttime            stoptime
    #1 2014-01-13 12:08:02 2014-01-13 12:19:46
    #2 2014-01-14 09:59:09 2014-01-14 10:05:09
    #3 2016-06-18 17:43:42 2016-06-18 18:06:59
    #4 2016-06-27 12:16:47 2016-06-27 12:29:05