Search code examples
rdataframedplyrdatasetdata-cleaning

Extract number within brackets at the end of a string - (YYYY)


I am trying to create two separate columns in R. The issue I am running into is that the year does not separate from the previous column depending on the type of observation.

Some names in the data frame are just first names and others have first and last. I am trying to have the Name display first and first/last and Year be separate from the current Name column.

Fake data = Employee and the year they started employment

Create data frame

library(tidyverse)

dat <- tibble(Name = c("Percy Vere (2020)", "Ginger Plant (2017)", "Perry (2019)",
                    "Pat Thettick (2020)", "Samuel (2022)", "Fay Daway (2008)",
                    "Greg (2022)", "Simon Sais (2011)"))
# A tibble: 8 x 1
  Name               
  <fct>              
1 Percy Vere (2020)  
2 Ginger Plant (2017)
3 Perry (2019)       
4 Pat Thettick (2020)
5 Samuel (2022)      
6 Fay Daway (2008)   
7 Greg (2022)        
8 Simon Sais (2011) 

Attempt separating the column into two: Name and Year

dat %>% 
  select_all() %>% 
  separate(col = Name, into = c('Name', 'Year')) %>%    # sep = ',' and ';' does not create a fix 
  tibble()

# A tibble: 8 x 2
  Name   Year    
  <chr>  <chr>   
1 Percy  Vere    
2 Ginger Plant   
3 Perry  2019    
4 Pat    Thettick
5 Samuel 2022    
6 Fay    Daway   
7 Greg   2022    
8 Simon  Sais    
Warning message:
Expected 2 pieces. Additional pieces discarded in 8 rows [1, 2, 3, 4, 5, 6, 7, 8]. 


Solution

  • Use extract:

    dat %>% 
       extract(Name, c("Name", "Year"), "(.*) .*?(\\d+)")
    # A tibble: 8 × 2
      Name         Year 
      <chr>        <chr>
    1 Percy Vere   2020 
    2 Ginger Plant 2017 
    3 Perry        2019 
    4 Pat Thettick 2020 
    5 Samuel       2022 
    6 Fay Daway    2008 
    7 Greg         2022 
    8 Simon Sais   2011
    

    Useseparate:

    dat %>%
       separate(Name, c("Name", "Year"), " \\(|\\)", extra = 'drop')
    
    # A tibble: 8 × 2
      Name         Year 
      <chr>        <chr>
    1 Percy Vere   2020 
    2 Ginger Plant 2017 
    3 Perry        2019 
    4 Pat Thettick 2020 
    5 Samuel       2022 
    6 Fay Daway    2008 
    7 Greg         2022 
    8 Simon Sais   2011 
    

    use str_extract:

    dat %>% 
       mutate(Year = str_extract(Name, '\\d+'), Name = str_extract(Name, "\\D+(?= )"))
    # A tibble: 8 × 2
      Name         Year 
      <chr>        <chr>
    1 Percy Vere   2020 
    2 Ginger Plant 2017 
    3 Perry        2019 
    4 Pat Thettick 2020 
    5 Samuel       2022 
    6 Fay Daway    2008 
    7 Greg         2022 
    8 Simon Sais   2011 
    

    Using str_remove

    dat %>% 
        mutate(Year = str_remove_all(Name, '\\D+'), Name = str_remove(Name, " [(]\\d+[)]"))
    # A tibble: 8 × 2
      Name         Year 
      <chr>        <chr>
    1 Percy Vere   2020 
    2 Ginger Plant 2017 
    3 Perry        2019 
    4 Pat Thettick 2020 
    5 Samuel       2022 
    6 Fay Daway    2008 
    7 Greg         2022 
    8 Simon Sais   2011