I am trying to import in R a text file including datetimes. Times are stored in character format, without timezone information, but we know it is French time (Europe/Paris).
An issue arise for the days of timezone change: e.g. there is a time change from 2018-10-28 03:00:00 CEST
to 2018-10-28 02:00:00 CET
, thus we have duplicates in our character format, and R cannot tell wether it is CEST
or CET
.
Consider the following example:
data_in <- "date,val
2018-10-28 01:30:00,25
2018-10-28 02:00:00,26
2018-10-28 02:30:00,27
2018-10-28 02:00:00,28
2018-10-28 02:30:00,29
2018-10-28 03:00:00,30"
library(readr)
data <- read_delim(data_in, ",", locale = locale(tz = "Europe/Paris"))
We end up having duplicates in our dates:
data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CET" "2018-10-28 02:00:00 CEST"
[5] "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
Expected output would be:
data$date
[1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST" "2018-10-28 02:00:00 CET"
[5] "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
Any idea how to solve the issue (besides telling people to use UTC or ISO formats). I guess the only way is to suppose the dates are sorted, so we can tell the first ones are CEST
.
If you are certain that your time is always-increasing, then you can look for an apparent decrease (of time-of-day) and manually insert the TZ offset to the string, then parse as usual. I added some logic to look for this decrease only around 2-3am so that if you have multiple days of data spanning midnight, you would not get a false-alarm.
data <- read.csv(text = data_in)
fakedate <- as.POSIXct(gsub("^[-0-9]+ ", "2000-01-01 ", data$date))
decreases <- cumany(grepl(" 0[23]:", data$date) & c(FALSE, diff(fakedate) < 0))
data$date <- paste(data$date, ifelse(decreases, "+0100", "+0200"))
data
# date val
# 1 2018-10-28 01:30:00 +0200 25
# 2 2018-10-28 02:00:00 +0200 26
# 3 2018-10-28 02:30:00 +0200 27
# 4 2018-10-28 02:00:00 +0100 28
# 5 2018-10-28 02:30:00 +0100 29
# 6 2018-10-28 03:00:00 +0100 30
as.POSIXct(data$date, format="%Y-%m-%d %H:%M:%S %z", tz="Europe/Paris")
# [1] "2018-10-28 01:30:00 CEST" "2018-10-28 02:00:00 CEST" "2018-10-28 02:30:00 CEST"
# [4] "2018-10-28 02:00:00 CET" "2018-10-28 02:30:00 CET" "2018-10-28 03:00:00 CET"
My use of "2000-01-01"
was just some non-DST day so that we can parse the timestamp into POSIXt
and calculate a diff on it. (If we didn't insert a date, we could still use as.POSIXct
with a format, but if you ever ran this on one of the two DST days, you might get different results since as.POSIXct("01:02:03", format="%H:%M:%S")
always assumes "today".
This is obviously a bit fragile with its assumptions, but perhaps it'll be good enough for what you need.