Search code examples
rregextidyversestringr

Flawed logic with RegEx and numeric ranges


I'm trying to create a new variable called 'group' in a dataset called 'data'. The variable 'group' should take the value "A" or "B" depending on how another variable in the dataset (of character type) ends. It just so happens that they end in a number from 7 to 24 after an underscore, as follows:

enter image description here

So, I want the new variable 'group' to be "A" when the ending number is 7 to 15 both inclusive, and "B" when the ending number is 16 to 24, again both inclusive.

I tried this mutate() function using str_detect() to discriminate within the character variable of interest:

data %>%
mutate(group = case_when(str_detect(string = year, pattern = "[7-9]|1[0-5]$") ~ "A",
                         str_detect(string = year, pattern = "1[6-9]|2[0-4]$") ~ "B")) 

However the resulting output is not quite right, as you can see below.

enter image description here

What's wrong in either the logic of case_when() or the RegEx itself that it gives the value "A" also to the numbers 16 to 19?

Thanks in advance!


Solution

  • Here is another option using the strex package:

    library(dplyr)
    library(strex)
    
    data |> 
      mutate(group = case_when(between(str_last_number(year), 7, 15) ~ "A",
                               between(str_last_number(year), 16, 24) ~ "B"))