Search code examples
rdplyrposixct

dplyr::if_else changes datetime (POSIXct) values


I'm working with a dataset that has a lot of timestamps. There are some invalid timestamps which I try to identify and set to NA. Because if_else() forces me to have the same data type in both arms, I'm using as.POSIXct(NA) to encode such missing values.

Interestingly, the results differ when I invert the test (and change the true and false argument) in if_else().

Here is some code to illustrate my problems:

x <- tibble(
  A = parse_datetime("2020-08-18 19:00"),
  B = if_else(TRUE,               A, as.POSIXct(NA)),
  C = if_else(FALSE, as.POSIXct(NA),              A)
)

> x
# A tibble: 1 x 3
  A                   B                   C                  
  <dttm>              <dttm>              <dttm>             
1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 21:00:00

Any idea, why C is two hours later?

Follow-up:

Based on the great answers below, I think a more readable solution should perhaps generate a missing datetime object with parse_datetime(NA_character_) and use this in the code instead of as.POSIXct().

R> NA_datetime_ <- parse_datetime(NA_character_)

R> x <- tibble(
  A = parse_datetime("2020-08-18 19:00"),
  B = if_else(TRUE,             A, NA_datetime_),
  C = if_else(FALSE, NA_datetime_,            A)
)

R> map(x, lubridate::tz)
$A
[1] "UTC"

$B
[1] "UTC"

$C
[1] "UTC"

Solution

  • At First, you need to know that parse_datetime() returns a date-time object with an tzone attribute default to UTC. You can use lubridate::tz(x$A) and attributes(x$A) to check it.

    From the document of if_else(), it said the true and false arguments must be the same type. All other attributes are taken from true. Hence, in part C of your tibble:

    C = if_else(FALSE, as.POSIXct(NA), A)
    

    as.POSIXct(NA) doesn't have a tzone attribute, so A's tzone is dropped and reset to the time zone of your region. Actually, C is not two hours later. The three columns have equal time but unequal time zones. To fix it, you can adjust as.POSIXct(NA) to own a tzone attribute, i.e. replace it with

    as.POSIXct(NA_character_, tz = "UTC")
    

    Note: You must use NA_character_ instead of NA because the tz argument in as.POSIXct() only works on character objects.


    Finally, revise your code as

    x <- tibble(
      A = parse_datetime("2020-08-18 19:00"),
      B = if_else(TRUE, A, as.POSIXct(NA_character_, tz = "UTC")),
      C = if_else(FALSE, as.POSIXct(NA_character_, tz = "UTC"), A)
    )
    
    # # A tibble: 1 x 3
    #   A                   B                   C                  
    #   <dttm>              <dttm>              <dttm>             
    # 1 2020-08-18 19:00:00 2020-08-18 19:00:00 2020-08-18 19:00:00
    

    Remember to check their time zones.

    R > lubridate::tz(x$A)
    [1] "UTC"
    R > lubridate::tz(x$B)
    [1] "UTC"
    R > lubridate::tz(x$C)
    [1] "UTC"