Search code examples
rdplyrstringr

Unexpected warnings with case_when and regex conditions suggest too many cases are matching


I have a data set with messy date/time formatting in some hh:mm formats and Excel serial numbers. So I've coerced everything into a string and I'm using stringr and readr within a large case_when block to identify different formats and process them properly. I think I'm misunderstanding either my stringr functions or case_when because I'm getting the output I expect, but it's throwing warnings of parsing failures and NA coercion that aren't in the final product.

Here are some dummy data with an example of each of the formats in my data set:

dummy <- tibble(x = c("13:15:21", "02:03:17+01:00", "12:03", "0.1234"))

I've made a function to identify and parse each of these formats. It calls on another function to change excel serial codes into times. It uses regular expressions which I think are correct, but to summarise:

  • ^0\\. should identify Excel serial numbers by the fact they are all decimals <1
  • \\+ is identifying the times with the DST timezone indicator by searching for the plus sign then
  • .+(?=\\+) is extracting everything before the plus sign to parse
  • : is testing for a colon to make sure the result is some kind of time. This is a broader test so it's coming last after the pluses have already been matched
convert_times <- function(x){
  case_when(str_detect(x, "^0\\.")        ~ convert_excel_time(x), 
            str_detect(x, "\\+")          ~ parse_time(str_extract(x, ".+(?=\\+)")), 
            str_detect(x, ":")            ~ parse_time(x),
            .default = NA)
}

convert_excel_time <- function(x){
  as.numeric(x) * 24 * 60 * 60 %>%
  as_datetime() %>% 
  hms::as_hms()
}

When I run it, I get the expected output, but the warnings that come along with it suggest to me I'm not understanding what's happening under the hood.

> dummy %>% 
+   mutate(new = convert_time(x))
# A tibble: 4 × 2
  x              new        
  <chr>          <time>     
1 13:15:21       13:15:21.00
2 02:03:17+01:00 02:03:17.00
3 12:03          12:03:00.00
4 0.1234         02:57:41.76

These are my errors

[[1]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning in `convert_excel_time()`:
! NAs introduced by coercion
---
Backtrace:
    ▆
 1. ├─dummy %>% mutate(new = convert_time(x))
 2. ├─dplyr::mutate(., new = convert_time(x))
 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

[[2]]
<warning/rlang_warning>
Warning in `mutate()`:
ℹ In argument: `new = convert_time(x)`.
Caused by warning:
! 2 parsing failures.
row col   expected         actual
  2  -- time like  02:03:17+01:00
  4  -- time like  0.1234        
---
Backtrace:
    ▆
 1. ├─dummy %>% mutate(new = convert_time(x))
 2. ├─dplyr::mutate(., new = convert_time(x))
 3. └─dplyr:::mutate.data.frame(., new = convert_time(x))

It seems to me, convert_time shouldn't be trying to parse those two observations at all since they are excluded by the left side of the case_when block. Similarly, I didn't expect NA coercion since the left hand side of the case_when prevents convert_excel_time() from seeing the hh:mm strings. Many thanks.


Solution

  • Gah! I didn't read all the documentation. The dplyr reference (https://dplyr.tidyverse.org/reference/case_when.html) clearly says case_when always solves all the RHS equations, which is why they throw a warning, but only uses the ones that match the LHS conditions.

    # `case_when()` evaluates all RHS expressions, and then constructs its
    # result by extracting the selected (via the LHS expressions) parts.
    # In particular `NaN`s are produced in this case:
    y <- seq(-2, 2, by = .5)
    case_when(
      y >= 0 ~ sqrt(y),
      .default = y
    )
    #> Warning: NaNs produced
    #> [1] -2.0000000 -1.5000000 -1.0000000 -0.5000000  0.0000000  0.7071068
    #> [7]  1.0000000  1.2247449  1.4142136