I want to convert a character string to a timestamp in an arrow table. I am using arrow because I am handling a large number of sizeable csvs.
I succeed in converting the string to a datetime object in a data frame but the same operation produces NA
s for the arrow table.
How can I convert the string to a timestamp in an arrow table?
Please see the example below.
library(arrow)
library(dplyr)
# Set up data
df <- data.frame(
timestamp_origin = c("1/9/2023 5:52:34 PM", "1/9/2023 5:52:40 PM"),
value = c(128.094, 128.094)
)
# Working example
df |>
mutate(
timestamp = strptime(
timestamp_origin,
format = "%m/%d/%Y %I:%M:%S %p",
tz = "CET"
)
)
#> timestamp_origin value timestamp
#> 1 1/9/2023 5:52:34 PM 128.094 2023-01-09 17:52:34
#> 2 1/9/2023 5:52:40 PM 128.094 2023-01-09 17:52:40
# Arrow table produces NAs
df |>
arrow_table() |>
mutate(
timestamp = strptime(
timestamp_origin,
format = "%m/%d/%Y %I:%M:%S %p",
tz = "CET"
)
) |>
collect()
#> timestamp_origin value timestamp
#> 1 1/9/2023 5:52:34 PM 128.094 <NA>
#> 2 1/9/2023 5:52:40 PM 128.094 <NA>
While PR13506 is pending, you can work around this (do as much lazily as possible) with this hack.
Steps: determine if it is "PM"; extract the hour; add 12 to hour if PM; recompose timestamp_origin
; use strptime
(or whichever) as desired. This all works lazily as it succeeds before the collect()
call.
df <- data.frame(
timestamp_origin = c("1/9/2023 12:52:34 PM", "1/9/2023 5:52:40 PM"),
value = c(128.094, 128.094)
)
df |>
arrow_table() |>
mutate(
isPM = grepl("PM", timestamp_origin),
isPM = isPM & !grepl(" 12:", timestamp_origin),
timestamp2 = sub("\\s*[AP]M$", "", timestamp_origin),
hour = as.integer(sub(".* (\\d{1,2}):.*", "\\1", timestamp2)),
hour = hour + if_else(isPM, 12, 0),
timestamp2 = paste0(sub(" .*", " ", timestamp2), hour, sub(".* \\d{1,2}:", ":", timestamp2)),
timestamp = strptime(
timestamp2,
format = "%m/%d/%Y %H:%M:%S",
tz = "CET"
)
) |>
select(-isPM, -hour, -timestamp2) |>
collect()
# timestamp_origin value timestamp
# 1 1/9/2023 12:52:34 PM 128.094 2023-01-09 12:52:34
# 2 1/9/2023 5:52:40 PM 128.094 2023-01-09 17:52:40
(The biggest reason I can consider to go through this level or work is if you need to filter
your data lazily based on ranges of timestamp
. This code enables this need.)
The second isPM
is to guard against 12:
being mis-converted to 24:
; sample data has been adjusted to test/confirm this.