I have a dataframe with sporting events (with no assumptions about the number of spaces or words) with an optional year, that can be formatted in a few different ways.
tibble::tibble(event_optional_year = c("World Championships", "Summer Olympics 12", "Olympics 2016", "Olympics 2020/221"))
How can I use tidyr::separate_wider_regex
to split event_optional_year
into two columns event
and year
? I want event
in this case to be stripped of the optional year, and year
equal to NA
, 12
, 2016
and 2020/2021
, respectively.
I tried fiddling with positive lookahead in the regex:
tibble::tibble(event_optional_year = c("Olympics", "Olympics 12", "Olympics 2016", "Olympics 2020/221")) |>
tidyr::separate_wider_regex(
"event_optional_year",
c(
event = ".*(?=(?:\\d.*\\d$)?)",
year = "\\d.*\\d$"
),
too_few = "align_start"
)
but this gives as result:
event year
<chr> <chr>
1 "World Championships" NA
2 "Summer Olympics " 12
3 "Olympics 20" 16
4 "Olympics 2020/2" 21
Question: which regex does give me the desired result?
Unnamed patterns in separate_wider_regex()
simplify this situation a bit. event = ".*"
is greedy and matches everything before "\\s+(?=\\d)"
-- any number of whitespace that is followed by a digit (assuming that year-part starts with a digit). This handles spaces in event but assumes there are none in year.
library(dplyr)
library(tidyr)
tibble(event_optional_year = c("World Championships",
"Summer Olympics 12",
"Olympics 2016",
"Olympics 2020/221")) %>%
separate_wider_regex(event_optional_year,
c(event = ".*", "\\s+(?=\\d)", year = ".*$") ,
too_few = "align_start")
#> # A tibble: 4 × 2
#> event year
#> <chr> <chr>
#> 1 World Championships <NA>
#> 2 Summer Olympics 12
#> 3 Olympics 2016
#> 4 Olympics 2020/221
Created on 2023-06-25 with reprex v2.0.2