Search code examples
rstringrgrepl

Why does grepl work but not str_detect for mutate depending on row value?


I have been trying to wrap my head around this.

I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).

Here is an example database:

test <- tibble(
  id = seq(1:30),
  date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
  error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
            rep(NA, 5), "wrong date! Correct date = 01.05.2021",
            rep(NA, 5), "wrong date! Correct date = 01.03.2022",
            rep(NA, 7), "wrong date! Correct date = 01.05.2021",
            rep(NA, 2), "date already corrected on 01.05.2021",
            NA, "date already corrected on 01.03.2022", NA))

I first tried to create a new "date_corr" column with str_detect:

test %>%
  mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
         date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

This yields:

A tibble: 30 × 4
      id date       error                                 date_corr 
   <int> <date>     <chr>                                 <date>    
 1     1 2010-04-28 NA                                    NA        
 2     2 2004-06-30 NA                                    NA        
 3     3 2015-09-25 NA                                    NA        
 4     4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
 5     5 2008-07-16 NA                                    NA        
 6     6 2004-08-02 NA                                    NA        
 7     7 2001-10-15 NA                                    NA        
 8     8 2007-07-21 NA                                    NA        
 9     9 2014-04-19 NA                                    NA        
10    10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows

Adding rowwise is irrelevant:

test %>%
  rowwise() %>%
  mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
         date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

A tibble: 30 × 4
# Rowwise: 
      id date       error                                 date_corr 
   <int> <date>     <chr>                                 <date>    
 1     1 2010-04-28 NA                                    NA        
 2     2 2004-06-30 NA                                    NA        
 3     3 2015-09-25 NA                                    NA        
 4     4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
 5     5 2008-07-16 NA                                    NA        
 6     6 2004-08-02 NA                                    NA        
 7     7 2001-10-15 NA                                    NA        
 8     8 2007-07-21 NA                                    NA        
 9     9 2014-04-19 NA                                    NA        
10    10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows

However, with grepl I get the desired outcome, regardless of rowwise:

test %>%
  mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
         date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))

# A tibble: 30 × 4
      id date       error                                 date_corr 
   <int> <date>     <chr>                                 <date>    
 1     1 2010-04-28 NA                                    2010-04-28
 2     2 2004-06-30 NA                                    2004-06-30
 3     3 2015-09-25 NA                                    2015-09-25
 4     4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
 5     5 2008-07-16 NA                                    2008-07-16
 6     6 2004-08-02 NA                                    2004-08-02
 7     7 2001-10-15 NA                                    2001-10-15
 8     8 2007-07-21 NA                                    2007-07-21
 9     9 2014-04-19 NA                                    2014-04-19
10    10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows

test %>%
  rowwise() %>%
  mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
         date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))

A tibble: 30 × 4
# Rowwise: 
      id date       error                                 date_corr 
   <int> <date>     <chr>                                 <date>    
 1     1 2010-04-28 NA                                    2010-04-28
 2     2 2004-06-30 NA                                    2004-06-30
 3     3 2015-09-25 NA                                    2015-09-25
 4     4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
 5     5 2008-07-16 NA                                    2008-07-16
 6     6 2004-08-02 NA                                    2004-08-02
 7     7 2001-10-15 NA                                    2001-10-15
 8     8 2007-07-21 NA                                    2007-07-21
 9     9 2014-04-19 NA                                    2014-04-19
10    10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows

What I am missing here?


Solution

  • The difference is how they handle NA values

    str_detect(NA, "missing")
    # [1] NA
    grepl("missing", NA)
    # [1] FALSE
    

    And note that if you have an NA value in the condition for if_else, it will also preserve the NA value

    if_else(NA, 1, 2)
    # [1] NA
    

    The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values

    test %>%
      mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
             date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))