Search code examples
rregextidyrregex-look-ahead

separate_wider_regex with lookahead


I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.

tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))

How can I use tidyr::separate_wider_regex to split event_optional_year into two columns event and year? I want event in this case to be stripped of the optional year, and year equal to NA, 12, 2016 and 2020/2021, respectively.

I tried fiddling with positive lookahead in the regex:

tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |> 
    tidyr::separate_wider_regex(
      "event_optional_year",
      c(
        event = ".*(?=(?:\\d.*\\d$)?)",
        year = "\\d.*\\d$"
      ),
      too_few = "align_start"
    )

but this gives as result:

  event                 year 
  <chr>                 <chr>
1 "World Championships" NA   
2 "Summer Olympics "    12   
3 "Olympics 20"         16   
4 "Olympics 2020/2"     21 

Question: which regex does give me the desired result?


Solution

  • Unnamed patterns in separate_wider_regex() simplify this situation a bit. event = ".*" is greedy and matches everything before "\\s+(?=\\d)" -- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.

    library(dplyr)
    library(tidyr)
    tibble(event_optional_year = c("World Championships", 
                                   "Summer Olympics 12", 
                                   "Olympics 2016", 
                                   "Olympics 2020/221")) %>% 
      separate_wider_regex(event_optional_year, 
                           c(event = ".*", "\\s+(?=\\d)", year = ".*$") , 
                           too_few = "align_start")
    #> # A tibble: 4 × 2
    #>   event               year    
    #>   <chr>               <chr>   
    #> 1 World Championships <NA>    
    #> 2 Summer Olympics     12      
    #> 3 Olympics            2016    
    #> 4 Olympics            2020/221
    

    Created on 2023-06-25 with reprex v2.0.2