Search code examples
rreplacedata-cleaningstringr

Removing dates ( in any format) form a text column


Hope everyone is well. In my dataset there is column including free texts. My goal is to remove all dates in any format form the text. this is a snapshot of the data

df <- data.frame(
  text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
         'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)

                               text
1            tommorow is 2022 11 03
2           I married on 2020-01-01
3 why not going there on 2023/01/14
4          2023 08 01 will be great

The outcome should look like

               text
1            tommorow is 
2            I married on 
3            why not going there on 
4            will be great

Thank you!


Solution

  • Best approach would perhaps be to have a sensitive regex pattern:

    df <- data.frame(
      text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
             'why not going there on 2023/01/14','2023 08 01 will be great'))
    
    library(tidyverse)
    
    df |>
      mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))
    
    #>                                text              left_text
    #> 1            tommorow is 2022 11 03            tommorow is
    #> 2           I married on 2020-01-01           I married on
    #> 3 why not going there on 2023/01/14 why not going there on
    #> 4          2023 08 01 will be great          will be great
    

    This will match dates by:

    • \\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
    • \\D = anything that's not a number, i.e. the separator; followed by
    • \\d{1,2} = day or month (1-2 chars); followed by
    • \\D again; ending with
    • \\d{1,4} = day or year (1-2 or 2-4 chars)

    The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:

    • dates with no year
    • dates with no separators
    • dates with double spaces between parts

    But hopefully should catch every sensible date in your text column!

    Further date detection examples:

    library(tidyverse)
    
    df <- data.frame(
      text = c(
        'tommorow is 2022 11 03',
        "I married on 2020-01-01",
        'why not going there on 2023/01/14',
        '2023 08 01 will be great',
        'A trickier example: January 05,2020',
        'or try Oct 2010',
        'dec 21/22 is another date'
      )
    )
    
    
    df |>
      mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |> 
               str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
                                       ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                                ignore_case = TRUE)) |> 
               str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
                                       ")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
                                ignore_case = TRUE)) |> 
               str_trim())
    
    #>                                  text              left_text
    #> 1              tommorow is 2022 11 03            tommorow is
    #> 2             I married on 2020-01-01           I married on
    #> 3   why not going there on 2023/01/14 why not going there on
    #> 4            2023 08 01 will be great          will be great
    #> 5 A trickier example: January 05,2020    A trickier example:
    #> 6                     or try Oct 2010                 or try
    #> 7           dec 21/22 is another date        is another date
    

    Final Edit - doing replace with temporary placeholders

    The following code should work on a wide range of date formats. It works by replacing in a specific order so as not to accidentally chop out bits of some dates. Gluing together pre-made regex patterns to hopefully give a clearer idea as to what each bit is doing:

    library(tidyverse)
    
    df <- data.frame(
      text = c(
        'tommorow is 2022 11 03',
        "I married on 2020-01-01",
        'why not going there on 2023/01/14',
        '2023 08 01 will be great',
        'A trickier example: January 05,2020',
        'or try Oct 26th 2010',
        'dec 21/22 is another date',
        'today is 2023-01-29 & tomorrow is 2022 11 03 & 2022-12-01',
        'A trickier example: January 05,2020',
        '2020-01-01 I married on 2020-12-01',
        'Adding in 1st December 2018',
        'And perhaps Jul 4th 2023'
      )
    )
    
    
    
    r_year <- "\\d{2,4}"
    r_day <- "\\d{1,2}(\\w{1,2})?"  # With or without "st" etc.
    r_month_num <- "\\d{1,2}"
    r_month_ab <- paste0("(", paste(month.abb, collapse = "|"), ")") 
    r_month_full <- paste0("(", paste(month.name, collapse = "|"), ")") 
    r_sep <- "[^\\w]+"  # The separators can be anything but letters
    
    library(glue)
    
    df |>
      mutate(
        text = 
          # Any numeric day/month/year
          str_replace_all(text,
                          glue("{r_day}{r_sep}{r_month_num}{r_sep}{r_year}"),
                          "REP_DATE") |> 
          # Any numeric month/day/year
          str_replace_all(glue("{r_month_num}{r_sep}{r_day}{r_sep}{r_year}"),
                          "REP_DATE") |> 
          # Any numeric year/month/day
          str_replace_all(glue("{r_year}{r_sep}{r_month_num}{r_sep}{r_day}"),
                          "REP_DATE") |> 
          # Any day[th]/monthname/year or monthname/day[th]/year
          str_replace_all(regex(paste0(
            glue("({r_day}{r_sep})?({r_month_full}|{r_month_ab})", 
                 "{r_sep}({r_day}{r_sep})?{r_year}")
            ), ignore_case = TRUE),
                               "REP_DATE") |> 
          # And transform all placeholders to required date
          str_replace_all("REP_DATE", "25th October 2022")
        )
    #>                                                                              text
    #> 1                                                   tommorow is 25th October 2022
    #> 2                                                  I married on 25th October 2022
    #> 3                                        why not going there on 25th October 2022
    #> 4                                                 25th October 2022 will be great
    #> 5                                           A trickier example: 25th October 2022
    #> 6                                                        or try 25th October 2022
    #> 7                                               25th October 2022 is another date
    #> 8  today is 25th October 2022 & tomorrow is 25th October 2022 & 25th October 2022
    #> 9                                           A trickier example: 25th October 2022
    #> 10                               25th October 2022 I married on 25th October 2022
    #> 11                                                    Adding in 25th October 2022
    #> 12                                                  And perhaps 25th October 2022
    

    This should catch all the most common ways of writing dates, even with added "st"s "nd"s and "th"s after day number and irrespective of ordering of parts (apart from any format which puts "year" in the middle between "day" and "month", but that seems unlikely).