Search code examples
rdplyrtimezoneaggregateposixct

Aggregate function and timezone


I have two sections of code that theoretically do the same thing:

Mn_min_max_D <- with(Mn, aggregate(Depth ~ as.Date(Date_time), FUN =  function(x) c(Min = min(x), Max = max(x))))
Mn_min_max_D <- do.call(data.frame, Mn_min_max_D)
names(Mn_min_max_D)[names(Mn_min_max_D) == "as.Date.Date_time."] <- "Date"

min_max_D <- with(Mn, aggregate(Depth ~ as.Date(Date), FUN =  function(x) c(Min = min(x), Max = max(x))))
min_max_D <- do.call(data.frame, min_max_D)
names(Mn_min_max_D)[names(min_max_D) == "as.Date.Date_time."] <- "Date"

However the output values are different. On inspecting the max depths, I can see that for some reason the timezone is being ignored on the first piece of code. For example the max depth happens at '2013-10-26 22:33:00', but with the time zone correction this is actually '2013-10-27 07:33:00'.

The $Date value comes from this code:

Mn$Date_time <- as.POSIXct(Mn$Date_time, format="%Y-%m-%d %H:%M:%S", tz = "Asia/Tokyo")
    Mn$Date <- format(as.POSIXct(Mn$Date_time, format="%YYYY/%m/%d %H:%M:%S"), format = "%Y/%m/%d")
    Mn$Date <- as.Date(Mn$Date, "%Y/%m/%d")

It seems that maybe the process of removing the time fixes the date. I need to understand where the issue stems from to make sure i don't make a mistake in the future.

I think I may need to do a %>% mutate with a tz but don't understand how at the moment. or maybe use dplyr to aggregate instead as below, but I've tried and the result is the same.

test <- Mn %>% group_by(as.Date(Date_time))%>% dplyr::summarise(min = min(Depth), max = max(Depth))

Example data:

Date_time Depth
2013-10-14 12:30:00 64.45
2013-10-14 12:30:05 65.95
2013-10-14 12:30:10 65.95
2013-10-14 12:30:15 66.45
2013-10-14 12:30:20 67.95
2013-10-14 12:30:25 66.95

Solution

  • In the present format the data does not carry the time zone so the default time zone is being used. If you are aware of the time zone for those timestamps it's better to control for it explicitly.

    dta <- with(
        asNamespace("readr"),
        read_table(
            file = "
    Date_time Depth
    2013-10-14-12:30:00 64.45
    2013-10-14-12:30:05 65.95
    2013-10-14-12:30:10 65.95
    2013-10-14-12:30:15 66.45
    2013-10-14-12:30:20 67.95
    2013-10-14-12:30:25 66.95",
    col_types = cols(
        Date_time = col_datetime(format = "%Y-%m-%d-%H:%M:%S"),
        Depth = col_double()
    )
        )
    )
    
    library("lubridate")
    library("tidyverse")
    dta %>%
        mutate(DT_tz = force_tz(Date_time, tzone = "GMT"),
               DT_tz_NYC = with_tz(Date_time, tzone = "America/New_York"))
    

    Explanation

    Consider the following:

    • tz(now()) returns an empty string
    • Sys.timezone() returns local time zone, "Europe/London" in my case
    • tz(as.Date(now())) returns "UTC"

    Without specifying time zones R falls on your local settings

    as.POSIXlt(Sys.time(), "America/New_York")
    # "2022-03-18 12:43:10 EDT"
    as.POSIXlt(Sys.time())
    # "2022-03-18 16:43:16 GMT"
    

    This can get a little fiddly.

    tz(as.POSIXlt(Sys.time()))
    # [1] "Europe/London"
    tz(as.Date(as.POSIXlt(Sys.time())))
    # "UTC"
    

    In particular, it's worth showing that using as.Date will strip out the time zone information.

    tz(as.Date(as.POSIXlt(Sys.time())))
    "UTC"
    tz(as.Date(as.POSIXlt(Sys.time()), tz = "Africa/Abidjan"))
    "UTC"
    

    Solution

    If dealing with timestamps it's always advisable to ensure that the timezone information is recoded within that data, or as an alternative, less robust option, stated explicitly within the script. Personally, I'm of a view that a time zone component is integral part of the timestamp and should reside with the data. Stripping time zone information from time stamp leads to confusion when localised timestamps differ. Significant differences may result in different dates (consider 2hr time zone difference and events taking place close to midnight, etc.).