I have a data set with messy date/time formatting in some hh:mm formats and Excel serial numbers. So I've coerced everything into a string and I'm using stringr
and readr
within a large case_when
block to identify different formats and process them properly. I think I'm misunderstanding either my stringr
functions or case_when
because I'm getting the output I expect, but it's throwing warnings of parsing failures and NA
coercion that aren't in the final product.
Here are some dummy data with an example of each of the formats in my data set:
dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))
I've made a function to identify and parse each of these formats. It calls on another function to change excel serial codes into times. It uses regular expressions which I think are correct, but to summarise:
^0\\.
should identify Excel serial numbers by the fact they are all
decimals <1\\+
is identifying the times with the DST timezone
indicator by searching for the plus sign then.+(?=\\+)
is
extracting everything before the plus sign to parse:
is testing
for a colon to make sure the result is some kind of time. This is a
broader test so it's coming last after the pluses have already been
matchedconvert_times <- function(x){
case_when(str_detect(x, "^0\\.") ~ convert_excel_time(x),
str_detect(x, "\\+") ~ parse_time(str_extract(x, ".+(?=\\+)")),
str_detect(x, ":") ~ parse_time(x),
.default = NA)
}
convert_excel_time <- function(x){
as.numeric(x) * 24 * 60 * 60 %>%
as_datetime() %>%
hms::as_hms()
}
When I run it, I get the expected output, but the warnings that come along with it suggest to me I'm not understanding what's happening under the hood.
> dummy %>%
+ mutate(new = convert_time(x))
# A tibble: 4 × 2
x new
<chr> <time>
1 13:15:21 13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03 12:03:00.00
4 0.1234 02:57:41.76
These are my errors
[[1]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning in `convert_excel_time()`:
! NAs introduced by coercion
---
Backtrace:
▆
1. ├─dummy %>% mutate(new = convert_time(x))
2. ├─dplyr::mutate(., new = convert_time(x))
3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
[[2]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning:
! 2 parsing failures.
row col expected actual
2 -- time like 02:03:17+01:00
4 -- time like 0.1234
---
Backtrace:
▆
1. ├─dummy %>% mutate(new = convert_time(x))
2. ├─dplyr::mutate(., new = convert_time(x))
3. └─dplyr:::mutate.data.frame(., new = convert_time(x))
It seems to me, convert_time
shouldn't be trying to parse those two observations at all since they are excluded by the left side of the case_when
block. Similarly, I didn't expect NA
coercion since the left hand side of the case_when
prevents convert_excel_time()
from seeing the hh:mm strings. Many thanks.
Gah! I didn't read all the documentation. The dplyr
reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says case_when
always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.
# `case_when()` evaluates all RHS expressions, and then constructs its
# result by extracting the selected (via the LHS expressions) parts.
# In particular `NaN`s are produced in this case:
y <- seq(-2, 2, by = .5)
case_when(
y >= 0 ~ sqrt(y),
.default = y
)
#> Warning: NaNs produced
#> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000 0.0000000 0.7071068
#> [7] 1.0000000 1.2247449 1.4142136