Search code examples
rcsvdatetimeposixcttimestamp-with-timezone

Timezone in timestamps imported from csv


I have a dataframe imported with fread from a csv file (converted from .xlsx). After import, class(inputData$timestamp) is Posixct and attr(inputData$timestamp, "tzone") is "".

As explained in this thread, I'm trying to split the dataframes by day, but they get split at 2am:

byDay <- split(inputData, as.Date(inputData$timestamp))
> byDay[[1]]
              timestamp value
1   2016-09-05 00:01:00     0
2   2016-09-05 00:02:00     0
3   2016-09-05 00:03:00     0
[...]
118 2016-09-05 01:58:00     0
119 2016-09-05 01:59:00     0

>byDay[[2]]
              timestamp value
120 2016-09-05 02:00:00     0
121 2016-09-05 02:01:00     0
122 2016-09-05 02:02:00     0
[...]
1558 2016-09-06 01:58:00    0
1559 2016-09-06 01:59:00    0

and so on.

I understand the date is implicitly different from what's shown on the timestamp:

> byDay[[1]]$timestamp[1]
"2016-09-05 00:01:00"
> as.Date(byDay[[1]]$timestamp[1])
"2016-09-04"

I'm not interested in timezones so I tried just setting all dates to UTC:

> attr(inputData$timestamp, "tzone") <- "UTC"
> byDay <- split(inputData, as.Date(inputData$timestamp))
> byDay[[1]]
              timestamp apAvg
1   2016-09-04 22:01:00     0
2   2016-09-04 22:02:00     0
3   2016-09-04 22:03:00     0
[...]
117 2016-09-04 23:57:00     0
118 2016-09-04 23:58:00     0
119 2016-09-04 23:59:00     0

This implies the timestamps actually start at 22:00. However in the csv file, the first timestamp is 05.09.2016 00:00:00 and I can't see a sign of timezone formatting. Is there still a possibility that the timezone problems come from the csv file? Or do I handle the timestamps wrong in R? How to deal with this accurately?


Solution

  • I would follow @alistaire's advice in setting the time zone when parsing from string. In fact, this appears to have happened since

    attr(inputData$timestamp, "tzone") is ""

    afterwards and "" is the default to set the data to your local timezone.

    The issue is that in using as.Date with a class POSIXct input, the default for tz is "UTC" and not your local timezone. To get what you want:

    1. Do not change the data's timezone to "UTC"
    2. Use as.Date(inputData$timestamp, tz="") in your split.

    To illustrate, let's say the data is:

    inputData<- structure(list(timestamp = structure(c(1473048000, 1473051600, 
    1473055200, 1473058800, 1473062400, 1473066000, 1473069600, 1473073200, 
    1473076800, 1473080400, 1473084000, 1473087600, 1473091200, 1473094800, 
    1473098400, 1473102000, 1473105600, 1473109200, 1473112800, 1473116400, 
    1473120000, 1473123600, 1473127200, 1473130800, 1473134400, 1473134460, 
    1473134520, 1473134580), class = c("POSIXct", "POSIXt"), tzone = ""), 
        value = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
        0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
        0L)), .Names = c("timestamp", "value"), row.names = c(NA, 
    -28L), class = "data.frame")
    ##             timestamp value
    ##1  2016-09-05 00:00:00     0
    ##2  2016-09-05 01:00:00     0
    ##3  2016-09-05 02:00:00     0
    ##4  2016-09-05 03:00:00     0
    ##5  2016-09-05 04:00:00     0
    ##6  2016-09-05 05:00:00     0
    ##7  2016-09-05 06:00:00     0
    ##8  2016-09-05 07:00:00     0
    ##9  2016-09-05 08:00:00     0
    ##10 2016-09-05 09:00:00     0
    ##11 2016-09-05 10:00:00     0
    ##12 2016-09-05 11:00:00     0
    ##13 2016-09-05 12:00:00     0
    ##14 2016-09-05 13:00:00     0
    ##15 2016-09-05 14:00:00     0
    ##16 2016-09-05 15:00:00     0
    ##17 2016-09-05 16:00:00     0
    ##18 2016-09-05 17:00:00     0
    ##19 2016-09-05 18:00:00     0
    ##20 2016-09-05 19:00:00     0
    ##21 2016-09-05 20:00:00     0
    ##22 2016-09-05 21:00:00     0
    ##23 2016-09-05 22:00:00     0
    ##24 2016-09-05 23:00:00     0
    ##25 2016-09-06 00:00:00     0
    ##26 2016-09-06 00:01:00     0
    ##27 2016-09-06 00:02:00     0
    ##28 2016-09-06 00:03:00     0
    

    with:

    attr(inputData$timestamp, "tzone")
    ##[1] ""
    

    Doing your split using as.Date with the default tz parameter:

    byDay <- split(inputData, as.Date(inputData$timestamp))
    byDay[[1]]
    ##             timestamp value
    ##1  2016-09-05 00:00:00     0
    ##2  2016-09-05 01:00:00     0
    ##3  2016-09-05 02:00:00     0
    ##4  2016-09-05 03:00:00     0
    ## ...
    ##17 2016-09-05 16:00:00     0
    ##18 2016-09-05 17:00:00     0
    ##19 2016-09-05 18:00:00     0
    ##20 2016-09-05 19:00:00     0
    byDay[[2]]
    ##             timestamp value
    ##21 2016-09-05 20:00:00     0
    ##22 2016-09-05 21:00:00     0
    ##23 2016-09-05 22:00:00     0
    ##24 2016-09-05 23:00:00     0
    ##25 2016-09-06 00:00:00     0
    ##26 2016-09-06 00:01:00     0
    ##27 2016-09-06 00:02:00     0
    ##28 2016-09-06 00:03:00     0
    

    Now, I'm on the east coast of the United States so I'm on EDT now. This is why 22:00:00 here is split to the next day in UTC. Do the same split with as.Date(inputData$timestamp,tz=""):

    byDay <- split(inputData, as.Date(inputData$timestamp,tz=""))
    byDay[[1]]
    ##             timestamp value
    ##1  2016-09-05 00:00:00     0
    ##2  2016-09-05 01:00:00     0
    ##3  2016-09-05 02:00:00     0
    ##4  2016-09-05 03:00:00     0
    ## ...
    ##21 2016-09-05 20:00:00     0
    ##22 2016-09-05 21:00:00     0
    ##23 2016-09-05 22:00:00     0
    ##24 2016-09-05 23:00:00     0
    byDay[[2]]
    ##             timestamp value
    ##25 2016-09-06 00:00:00     0
    ##26 2016-09-06 00:01:00     0
    ##27 2016-09-06 00:02:00     0
    ##28 2016-09-06 00:03:00     0