Search code examples
rconditional-statementstidyverselubridate

Why are lubridate parsing functions failing when used within a case_when() statement?


I have data with mixed frequencies and the date variable is a character vector. I am trying to use case_when() to parse the dates depending on the assigned frequency. The yq() and ym() functions each work fine if I split the dataframe by frequency and don't use the conditional logic. However, I am getting warnings about parsing failures when I use case_when() on the mixed frequency table. Specifically, the failures are attributed to months 05-12 in monthly data. The parsing functions appear to return the expected result, but I am concerned whether that will always be the case and whether there's a pitfall I'm missing. What is the actual parsing failure and why is it happening?

Reprex:

library(dplyr, quietly = TRUE)
library(lubridate, quietly = TRUE)
library(purrr, quietly = TRUE)
library(stringr, quietly = TRUE)

### Data - construct date columns in the format of my data
qtrs <- paste(":0", 1:4, sep = "")
mnths <- c(paste(":0", 1:9, sep = ""), paste(":", 10:12, sep = ""))

yr_qtrs <- map(1979:1980, ~str_c(.x, qtrs)) %>% list_c()
yr_mnths <- map(1979:1980, ~str_c(.x, mnths)) %>% list_c()

#### Create separate tables and assign frequency variable
qtr_tbl <- tibble(date = yr_qtrs, freq = "quarterly")
mnth_tbl <- tibble(date = yr_mnths, freq = "monthly")

#### Create a mixed frequency tibble
mixed_tbl <- bind_rows(qtr_tbl, mnth_tbl)

### Quarters parse fine with yq()
qtr_tbl %>% 
  mutate(date_new = yq(date))
#> # A tibble: 8 × 3
#>   date    freq      date_new  
#>   <chr>   <chr>     <date>    
#> 1 1979:01 quarterly 1979-01-01
#> 2 1979:02 quarterly 1979-04-01
#> 3 1979:03 quarterly 1979-07-01
#> 4 1979:04 quarterly 1979-10-01
#> 5 1980:01 quarterly 1980-01-01
#> 6 1980:02 quarterly 1980-04-01
#> 7 1980:03 quarterly 1980-07-01
#> 8 1980:04 quarterly 1980-10-01

### Months parse fine with ym()
mnth_tbl %>% 
  mutate(date_new = ym(date))
#> # A tibble: 24 × 3
#>    date    freq    date_new  
#>    <chr>   <chr>   <date>    
#>  1 1979:01 monthly 1979-01-01
#>  2 1979:02 monthly 1979-02-01
#>  3 1979:03 monthly 1979-03-01
#>  4 1979:04 monthly 1979-04-01
#>  5 1979:05 monthly 1979-05-01
#>  6 1979:06 monthly 1979-06-01
#>  7 1979:07 monthly 1979-07-01
#>  8 1979:08 monthly 1979-08-01
#>  9 1979:09 monthly 1979-09-01
#> 10 1979:10 monthly 1979-10-01
#> # ℹ 14 more rows

### Chooses the parsing function based on the freq variable for the mixed freq tbl.
### It gives 16 warnings, which correspond to months 05:12 for each of the two years

mixed_tbl %>% 
  mutate(date_new = case_when(freq == "monthly"   ~ ym(date),
                              freq == "quarterly" ~ yq(date),
                              TRUE                ~ NA_Date_
                              )
         ) %>% print(n = 32)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `date_new = case_when(...)`.
#> Caused by warning:
#> !  16 failed to parse.
#> # A tibble: 32 × 3
#>    date    freq      date_new  
#>    <chr>   <chr>     <date>    
#>  1 1979:01 quarterly 1979-01-01
#>  2 1979:02 quarterly 1979-04-01
#>  3 1979:03 quarterly 1979-07-01
#>  4 1979:04 quarterly 1979-10-01
#>  5 1980:01 quarterly 1980-01-01
#>  6 1980:02 quarterly 1980-04-01
#>  7 1980:03 quarterly 1980-07-01
#>  8 1980:04 quarterly 1980-10-01
#>  9 1979:01 monthly   1979-01-01
#> 10 1979:02 monthly   1979-02-01
#> 11 1979:03 monthly   1979-03-01
#> 12 1979:04 monthly   1979-04-01
#> 13 1979:05 monthly   1979-05-01
#> 14 1979:06 monthly   1979-06-01
#> 15 1979:07 monthly   1979-07-01
#> 16 1979:08 monthly   1979-08-01
#> 17 1979:09 monthly   1979-09-01
#> 18 1979:10 monthly   1979-10-01
#> 19 1979:11 monthly   1979-11-01
#> 20 1979:12 monthly   1979-12-01
#> 21 1980:01 monthly   1980-01-01
#> 22 1980:02 monthly   1980-02-01
#> 23 1980:03 monthly   1980-03-01
#> 24 1980:04 monthly   1980-04-01
#> 25 1980:05 monthly   1980-05-01
#> 26 1980:06 monthly   1980-06-01
#> 27 1980:07 monthly   1980-07-01
#> 28 1980:08 monthly   1980-08-01
#> 29 1980:09 monthly   1980-09-01
#> 30 1980:10 monthly   1980-10-01
#> 31 1980:11 monthly   1980-11-01
#> 32 1980:12 monthly   1980-12-01

Thanks in advance for feedback!


Solution

  • case_when does not control what will be evaluated nor does it mask input vectors by conditions, meaning that yq(date) is exactly what it is -- yq() called with a date vector. Just to illustrate what's happening here, we could make ym() and yq() calls a bit more verbose:

    library(dplyr, quietly = TRUE)
    library(lubridate, quietly = TRUE)
    library(purrr, quietly = TRUE)
    library(stringr, quietly = TRUE)
    
    # override ym & yq to get an idea how those are called
    ym <- \(dates, ...) {message(">> ym() called with: ", paste(dates, collapse = " ")); lubridate::ym(dates,...)}
    yq <- \(dates, ...) {message(">> yq() called with: ", paste(dates, collapse = " ")); lubridate::yq(dates,...)}
    
    # test how it works:
    ym(c("2000:04","2000:05"))
    #> >> ym() called with: 2000:04 2000:05
    #> [1] "2000-04-01" "2000-05-01"
    
    yq(c("2000:04","2000:05"))
    #> >> yq() called with: 2000:04 2000:05
    #> Warning: 1 failed to parse.
    #> [1] "2000-10-01" NA
    
    # now with case_when() to see if input vectors 
    # for ym() and yq() are different:
    mixed_tbl %>% 
      mutate(date_new = case_when(freq == "monthly"   ~ ym(date),
                                  freq == "quarterly" ~ yq(date),
                                  TRUE                ~ NA_Date_
      )
      ) %>% 
      glimpse()
    #> >> ym() called with: 1979:01 1979:02 1979:03 1979:04 1980:01 1980:02 1980:03 1980:04 1979:01 1979:02 1979:03 1979:04 1979:05 1979:06 1979:07 1979:08 1979:09 1979:10 1979:11 1979:12 1980:01 1980:02 1980:03 1980:04 1980:05 1980:06 1980:07 1980:08 1980:09 1980:10 1980:11 1980:12
    #> >> yq() called with: 1979:01 1979:02 1979:03 1979:04 1980:01 1980:02 1980:03 1980:04 1979:01 1979:02 1979:03 1979:04 1979:05 1979:06 1979:07 1979:08 1979:09 1979:10 1979:11 1979:12 1980:01 1980:02 1980:03 1980:04 1980:05 1980:06 1980:07 1980:08 1980:09 1980:10 1980:11 1980:12
    #> Warning: There was 1 warning in `mutate()`.
    #> ℹ In argument: `date_new = case_when(...)`.
    #> Caused by warning:
    #> !  16 failed to parse.
    #> Rows: 32
    #> Columns: 3
    #> $ date     <chr> "1979:01", "1979:02", "1979:03", "1979:04", "1980:01", "1980:…
    #> $ freq     <chr> "quarterly", "quarterly", "quarterly", "quarterly", "quarterl…
    #> $ date_new <date> 1979-01-01, 1979-04-01, 1979-07-01, 1979-10-01, 1980-01-01, …
    

    One possible solution, split by freq, apply apropriate function(s) on each part and bind the resulting list back to a single df:

    mixed_tbl %>% 
      split(~freq) %>% 
      map_at("monthly",   ~ mutate(.x, date_new = ym(date))) %>% 
      map_at("quarterly", ~ mutate(.x, date_new = yq(date))) %>% 
      list_rbind() %>% 
      glimpse()
    #> >> ym() called with: 1979:01 1979:02 1979:03 1979:04 1979:05 1979:06 1979:07 1979:08 1979:09 1979:10 1979:11 1979:12 1980:01 1980:02 1980:03 1980:04 1980:05 1980:06 1980:07 1980:08 1980:09 1980:10 1980:11 1980:12
    #> >> yq() called with: 1979:01 1979:02 1979:03 1979:04 1980:01 1980:02 1980:03 1980:04
    #> Rows: 32
    #> Columns: 3
    #> $ date     <chr> "1979:01", "1979:02", "1979:03", "1979:04", "1979:05", "1979:…
    #> $ freq     <chr> "monthly", "monthly", "monthly", "monthly", "monthly", "month…
    #> $ date_new <date> 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01, …
    

    Input data:

    mixed_tbl <- structure(list(
      date = c("1979:01", "1979:02", "1979:03", "1979:04", 
    "1980:01", "1980:02", "1980:03", "1980:04", "1979:01", "1979:02", 
    "1979:03", "1979:04", "1979:05", "1979:06", "1979:07", "1979:08", 
    "1979:09", "1979:10", "1979:11", "1979:12", "1980:01", "1980:02", 
    "1980:03", "1980:04", "1980:05", "1980:06", "1980:07", "1980:08", 
    "1980:09", "1980:10", "1980:11", "1980:12"), freq = c("quarterly", 
    "quarterly", "quarterly", "quarterly", "quarterly", "quarterly", 
    "quarterly", "quarterly", "monthly", "monthly", "monthly", "monthly", 
    "monthly", "monthly", "monthly", "monthly", "monthly", "monthly", 
    "monthly", "monthly", "monthly", "monthly", "monthly", "monthly", 
    "monthly", "monthly", "monthly", "monthly", "monthly", "monthly", 
    "monthly", "monthly")), class = c("tbl_df", "tbl", "data.frame"
    ), row.names = c(NA, -32L))
    

    Created on 2023-05-19 with reprex v2.0.2