Search code examples
rapache-arrow

Converting characters to timestamp in an arrow table in R


I want to convert a character string to a timestamp in an arrow table. I am using arrow because I am handling a large number of sizeable csvs.

I succeed in converting the string to a datetime object in a data frame but the same operation produces NAs for the arrow table.

How can I convert the string to a timestamp in an arrow table?

Please see the example below.

library(arrow)
library(dplyr)

# Set up data
df <- data.frame(
  timestamp_origin = c("1/9/2023 5:52:34 PM", "1/9/2023 5:52:40 PM"),
  value = c(128.094, 128.094)
)

# Working example
df |> 
  mutate(
    timestamp = strptime(
      timestamp_origin, 
      format = "%m/%d/%Y %I:%M:%S %p", 
      tz = "CET"
    )
  ) 
#>      timestamp_origin   value           timestamp
#> 1 1/9/2023 5:52:34 PM 128.094 2023-01-09 17:52:34
#> 2 1/9/2023 5:52:40 PM 128.094 2023-01-09 17:52:40

# Arrow table produces NAs
df |> 
  arrow_table() |> 
  mutate(
    timestamp = strptime(
      timestamp_origin, 
      format = "%m/%d/%Y %I:%M:%S %p", 
      tz = "CET"
    )
  ) |> 
  collect()
#>      timestamp_origin   value timestamp
#> 1 1/9/2023 5:52:34 PM 128.094      <NA>
#> 2 1/9/2023 5:52:40 PM 128.094      <NA>

Solution

  • While PR13506 is pending, you can work around this (do as much lazily as possible) with this hack.

    Steps: determine if it is "PM"; extract the hour; add 12 to hour if PM; recompose timestamp_origin; use strptime (or whichever) as desired. This all works lazily as it succeeds before the collect() call.

    df <- data.frame(
      timestamp_origin = c("1/9/2023 12:52:34 PM", "1/9/2023 5:52:40 PM"),
      value = c(128.094, 128.094)
    )
    df |>
      arrow_table() |> 
      mutate(
        isPM = grepl("PM", timestamp_origin),
        isPM = isPM & !grepl(" 12:", timestamp_origin),
        timestamp2 = sub("\\s*[AP]M$", "", timestamp_origin),
        hour = as.integer(sub(".* (\\d{1,2}):.*", "\\1", timestamp2)),
        hour = hour + if_else(isPM, 12, 0),
        timestamp2 = paste0(sub(" .*", " ", timestamp2), hour, sub(".* \\d{1,2}:", ":", timestamp2)),
        timestamp = strptime(
          timestamp2, 
          format = "%m/%d/%Y %H:%M:%S", 
          tz = "CET"
        )
      ) |> 
      select(-isPM, -hour, -timestamp2) |>
      collect()
    #       timestamp_origin   value           timestamp
    # 1 1/9/2023 12:52:34 PM 128.094 2023-01-09 12:52:34
    # 2  1/9/2023 5:52:40 PM 128.094 2023-01-09 17:52:40
    

    (The biggest reason I can consider to go through this level or work is if you need to filter your data lazily based on ranges of timestamp. This code enables this need.)

    The second isPM is to guard against 12: being mis-converted to 24:; sample data has been adjusted to test/confirm this.