Search code examples
rtimezonelubridateposixctreadr

Read character datetimes without timezones


I am trying to import in R a text file including datetimes. Times are stored in character format, without timezone information, but we know it is French time (Europe/Paris).

An issue arise for the days of timezone change: e.g. there is a time change from 2018-10-28 03:00:00 CEST to 2018-10-28 02:00:00 CET, thus we have duplicates in our character format, and R cannot tell wether it is CEST or CET.

Consider the following example:

data_in <- "date,val
2018-10-28 01:30:00,25
2018-10-28 02:00:00,26
2018-10-28 02:30:00,27
2018-10-28 02:00:00,28
2018-10-28 02:30:00,29
2018-10-28 03:00:00,30"

library(readr)
data <- read_delim(data_in, ",", locale = locale(tz = "Europe/Paris"))

We end up having duplicates in our dates:

data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CET"  "2018-10-28 02:00:00 CEST"
[5] "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 

Expected output would be:

data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"  "2018-10-28 02:00:00 CET"
[5] "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 

Any idea how to solve the issue (besides telling people to use UTC or ISO formats). I guess the only way is to suppose the dates are sorted, so we can tell the first ones are CEST.


Solution

  • If you are certain that your time is always-increasing, then you can look for an apparent decrease (of time-of-day) and manually insert the TZ offset to the string, then parse as usual. I added some logic to look for this decrease only around 2-3am so that if you have multiple days of data spanning midnight, you would not get a false-alarm.

    data <- read.csv(text = data_in)
    fakedate <- as.POSIXct(gsub("^[-0-9]+ ", "2000-01-01 ", data$date))
    decreases <- cumany(grepl(" 0[23]:", data$date) & c(FALSE, diff(fakedate) < 0))
    data$date <- paste(data$date, ifelse(decreases, "+0100", "+0200"))
    data
    #                        date val
    # 1 2018-10-28 01:30:00 +0200  25
    # 2 2018-10-28 02:00:00 +0200  26
    # 3 2018-10-28 02:30:00 +0200  27
    # 4 2018-10-28 02:00:00 +0100  28
    # 5 2018-10-28 02:30:00 +0100  29
    # 6 2018-10-28 03:00:00 +0100  30
    
    as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S %z", tz="Europe/Paris")
    # [1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"
    # [4] "2018-10-28 02:00:00 CET"  "2018-10-28 02:30:00 CET"  "2018-10-28 03:00:00 CET" 
    

    My use of "2000-01-01" was just some non-DST day so that we can parse the timestamp into POSIXt and calculate a diff on it. (If we didn't insert a date, we could still use as.POSIXct with a format, but if you ever ran this on one of the two DST days, you might get different results since as.POSIXct("01:02:03", format="%H:%M:%S") always assumes "today".

    This is obviously a bit fragile with its assumptions, but perhaps it'll be good enough for what you need.