Search code examples
rgsubstringrtidytext

Extracting mixed date from string in R


I have a vector of characters that looks like the table below, I would like to extract the dates from them and convert them as.Date. For example, row one would be 09-11-2021. The last number in the string is the number of columns and not part of the date.

   <chr>                                                                       
 1 By Leigh-Ann Butler, Shannon Cobb, Michael R. DonaldsonNov 9, 20213 Comments
 2 By Leigh-Ann Butler, Shannon Cobb, Michael R. DonaldsonNov 8, 20212 Comments
 3 By Rick AndersonNov 4, 202114 Comments                                      
 4 By Victoria Ficarra, Rob JohnsonNov 3, 20215 Comments                       
 5 By Roger C. SchonfeldNov 1, 202123 Comments                                 
 6 By Joseph EspositoOct 29, 20211 Comment                                     
 7 By Brigitte ShullOct 20, 20216 Comments                 
example.data <- c("By Leigh-Ann Butler, Shannon Cobb, Michael R. DonaldsonNov 9, 20213 Comments",
"By Leigh-Ann Butler, Shannon Cobb, Michael R. DonaldsonNov 8, 20212 Comments",
"By Rick AndersonNov 4, 202114 Comments",                                      
"By Victoria Ficarra, Rob JohnsonNov 3, 20215 Comments")


Solution

  • strcapture(".*(\\D{3})\\s+(\\d{1,2}),\\s+(\\d{4}).*",
               example.data, proto = list(mon="", day=0L, year=0L)) |>
      transform(date = as.Date(paste(mon, day, year), format = "%b %d %Y"))
    #   mon day year       date
    # 1 Nov   9 2021 2021-11-09
    # 2 Nov   8 2021 2021-11-08
    # 3 Nov   4 2021 2021-11-04
    # 4 Nov   3 2021 2021-11-03