I'm working with a dataset that has a lot of timestamps. There are some invalid timestamps which I try to identify and set to NA. Because if_else()
forces me to have the same data type in both arms, I'm using as.POSIXct(NA)
to encode such missing values.
Interestingly, the results differ when I invert the test (and change the true
and false
argument) in if_else()
.
Here is some code to illustrate my problems:
x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, as.POSIXct(NA)),
C = if_else(FALSE, as.POSIXct(NA), A)
)
> x
# A tibble: 1 x 3
A B C
<dttm> <dttm> <dttm>
1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 21:00:00
Any idea, why C is two hours later?
Based on the great answers below, I think a more readable solution should perhaps generate a missing datetime object with parse_datetime(NA_character_)
and use this in the code instead of as.POSIXct()
.
R> NA_datetime_ <- parse_datetime(NA_character_)
R> x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, NA_datetime_),
C = if_else(FALSE, NA_datetime_, A)
)
R> map(x, lubridate::tz)
$A
[1] "UTC"
$B
[1] "UTC"
$C
[1] "UTC"
At First, you need to know that parse_datetime()
returns a date-time object with an tzone
attribute default to UTC
. You can use lubridate::tz(x$A)
and attributes(x$A)
to check it.
From the document of if_else()
, it said the true
and false
arguments must be the same type. All other attributes are taken from true
. Hence, in part C
of your tibble:
C = if_else(FALSE, as.POSIXct(NA), A)
as.POSIXct(NA)
doesn't have a tzone
attribute, so A
's tzone
is dropped and reset to the time zone of your region. Actually, C
is not two hours later. The three columns have equal time but unequal time zones. To fix it, you can adjust as.POSIXct(NA)
to own a tzone
attribute, i.e. replace it with
as.POSIXct(NA_character_, tz = "UTC")
Note: You must use NA_character_
instead of NA
because the tz
argument in as.POSIXct()
only works on character objects.
Finally, revise your code as
x <- tibble(
A = parse_datetime("2020-08-18 19:00"),
B = if_else(TRUE, A, as.POSIXct(NA_character_, tz = "UTC")),
C = if_else(FALSE, as.POSIXct(NA_character_, tz = "UTC"), A)
)
# # A tibble: 1 x 3
# A B C
# <dttm> <dttm> <dttm>
# 1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 19:00:00
Remember to check their time zones.
R > lubridate::tz(x$A)
[1] "UTC"
R > lubridate::tz(x$B)
[1] "UTC"
R > lubridate::tz(x$C)
[1] "UTC"