Search code examples
pythonrdate

Parse Messy Dates from an Object Column and Replace with Formatted Dates


I'm currently new to R and I'm trying to make a script for cleaning messy dates. Basically, I have a linelist of deaths and it has a column called "Date of Death". This column contains dates that are not uniformly formatted so I'd like to parse and format the dates into a uniform date format however, I'm getting the following error (cleanlist object not found) however, when I manually try to run the cleanlist code, it runs without any issues. I just get the issue below if I try to run the whole script. Also, I want to know if how would I properly use the mutate function to update the date_of_death column in the clean_list table/object.

# Loading packages --------------------------------------------------------

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(
  lubridate,  # general package for handling and converting dates  
  parsedate,  # has function to "guess" messy dates
  here,       # file management
  rio,        # data import/export
  janitor,    # data cleaning
  epikit,     # for creating age categories
  tidyverse,  # data management and visualization 
  magrittr,
  dplyr,
  reprex,     # minimal repr example
  datapasta   # sample data
)

datecolumn <- "date_of_death"
  
# Data Import -------------------------------------------------------------

  linelist <- data.frame(
    stringsAsFactors = FALSE,
    check.names = FALSE,
    `Date of Death` = c("45236","45212","45152",
                        "JANUARY 19, 2023","June 25, 2023","45200","45164",
                        "5/16/2023","45277","44930"))
  
# Data Cleaning -----------------------------------------------------------

cleanlist <- linelist %>%      # the raw dataset
  clean_names() %>%            # automatically clean column names
  
# Format Dates ------------------------------------------------------------
  
  # parse the date values
  mutate(parsedate::parse_date(cleanlist[[datecolumn]]))
#> Error in `mutate()`:
#> ℹ In argument: `parsedate::parse_date(cleanlist[[datecolumn]])`.
#> Caused by error:
#> ! object 'cleanlist' not found

Created on 2024-08-12 with reprex v2.1.1


Solution

  • A base R approach. grepl identifies non-numeric dates by looking for , or /.

    data.frame(df,
               cleaned_date_of_death = as.Date(sapply(df$'Date of Death', \(x) 
      ifelse(grepl(",|/", x), as.Date(x, tryFormats = c("%m/%d/%Y", "%B %d, %Y")),
                              as.Date(as.numeric(x), origin = "1899-12-30")))),
      row.names = NULL)
          Date.of.Death cleaned_date_of_death
    1             45236            2023-11-06
    2             45212            2023-10-13
    3             45152            2023-08-14
    4  JANUARY 19, 2023            2023-01-19
    5     June 25, 2023            2023-06-25
    6             45200            2023-10-01
    7             45164            2023-08-26
    8         5/16/2023            2023-05-16
    9             45277            2023-12-17
    10            44930            2023-01-04