Search code examples
rdatetimeunify

How to unify Date & Time format when there inconsistency exists across a single column


I am working on a data frame that has a column that uses the date and time format as "mm/dd/yy HH:MM" for some observations and "yyyy/mm/dd HH:MM:SS" format for other observations, of course, this inconsistency results in errors or NA returns in my code, how can I unify the whole column so my calculations are not interrupted by this inconsistency?

enter image description here


Solution

  • Update step by step:

    1. We have a data frame with for columns each of them have character type columns (you can check this with str(df)

    2. In order to change the format from character to datetime in all four columns we use mutate(across(1:4, ...

    3. What we want is that in each column 1:4 the character type is changed to datetime

    4. this can be done with the function parse_date_time from lubridate package

    5. Here we use ~ to indicate an anonymous function

    6. the . indicates column 1-4.

    7. and most important the argument c("ymd_HMS", "mdy_HM") which gives the order of the different formats of the date columns!

    We could use parse_date_time() from lubridate package. Important is the argument c("ymd_HMS", "mdy_HM"). Here you define the occurence of the different formats:

    and note to use HMS , because:

    hms, hm and ms usage is defunct, please use HMS, HM or MS instead. Deprecated in version '1.5.6'.
    
    library(dplyr)
    library(lubridate)
    
    df %>% 
      mutate(across(1:4, ~parse_date_time(., c("ymd_HMS", "mdy_HM"))))
    
      started_at          ended_at            started_at_1        ended_at_1         
      <dttm>              <dttm>              <dttm>              <dttm>             
    1 2021-10-29 17:42:36 2021-10-29 18:00:23 2021-06-13 11:40:00 2021-06-13 12:02:00
    2 2021-10-01 15:06:10 2021-10-01 15:09:23 2021-06-27 16:26:00 2021-06-27 16:39:00
    3 2021-10-28 23:02:53 2021-10-28 23:07:11 2021-06-10 20:06:00 2021-06-10 20:28:00
    4 2021-10-17 00:58:17 2021-10-17 01:02:08 2021-06-11 15:54:00 2021-06-11 16:11:00
    5 2021-10-27 18:29:34 2021-10-27 18:34:48 2021-06-05 14:09:00 2021-06-05 14:42:00
    6 2021-10-17 13:30:21 2021-10-17 13:35:26 2021-06-05 14:14:00 2021-06-05 14:37:00
    7 2021-10-04 19:59:28 2021-10-04 21:06:24 2021-06-16 19:05:00 2021-06-16 19:16:00
    8 2021-10-10 00:27:09 2021-10-10 00:39:58 2021-06-23 20:29:00 2021-06-23 20:43:00
    

    data:

    structure(list(started_at = c("2021-10-29 17:42:36", "2021-10-01 15:06:10", 
    "2021-10-28 23:02:53", "2021-10-17 00:58:17", "2021-10-27 18:29:34", 
    "2021-10-17 13:30:21", "2021-10-04 19:59:28", "2021-10-10 00:27:09"
    ), ended_at = c("2021-10-29 18:00:23", "2021-10-01 15:09:23", 
    "2021-10-28 23:07:11", "2021-10-17 01:02:08", "2021-10-27 18:34:48", 
    "2021-10-17 13:35:26", "2021-10-04 21:06:24", "2021-10-10 00:39:58"
    ), started_at_1 = c("6/13/21 11:40", "6/27/21 16:26", "6/10/21 20:06", 
    "6/11/21 15:54", "6/5/21  14:09", "6/5/21  14:14", "6/16/21 19:05", 
    "6/23/21 20:29"), ended_at_1 = c("6/13/21 12:02", "6/27/21 16:39", 
    "6/10/21 20:28", "6/11/21 16:11", "6/5/21  14:42", "6/5/21  14:37", 
    "6/16/21 19:16", "6/23/21 20:43")), class = "data.frame", row.names = c(NA, 
    -8L))