Search code examples
rstringtimezonelubridate

Transform string in date through Lubridate with variation in month, day, year hour min am/pm and time zone


I need some help with a lubridate function over different time zones. I have two vectors of the kind:

date1 = c("February 11th 2017, 6:05am PST", "April 24th 2018, 4:09pm PDT") 
date2 = c("2013-12-14 00:58:00 CET", "2013-06-19 18:00:00 CEST")

I would like to use lubridate functions (I tried mdy_hm) to transform these strings into date format, and then take the difference (in days) across the two strings while taking into account the difference in time zone, where D in PDT stands for Day Light and S in PST stands for Standard time zone for Pacific time (https://www.timeanddate.com/time/zones/pdt and https://www.timeanddate.com/time/zones/pst) and similarly for CET (https://time.is/CET) and CEST (https://time.is/CEST). Could you please help me?


Solution

  • First thing I did was to setup a tibble with your 2 date vectors

    tibble(
      date1 = c("February 11th 2017, 6:05am PST", "April 24th 2018, 4:09pm PDT"),
      date2 = c("2013-12-14 00:58:00 CET", "2013-06-19 18:00:00 CEST"),
    ) %>%
      {. ->> my_dates}
    
    my_dates
    
    # # A tibble: 2 x 2
    # date1                          date2                   
    # <chr>                          <chr>                   
    # February 11th 2017, 6:05am PST 2013-12-14 00:58:00 CET 
    # April 24th 2018, 4:09pm PDT    2013-06-19 18:00:00 CEST
    

    Then, make a tibble of the timezone abbreviations and their offset from UTC

    # setup timezones and UTC offsets
    tribble(
      ~tz, ~offset,
      'PST', -8,
      'PDT', -7,
      'CET', +1,
      'CEST', +2
    ) %>%
      {. ->> my_tz}
    
    my_tz
    
    # # A tibble: 4 x 2
    # tz    offset
    # <chr>  <dbl>
    # PST       -8
    # PDT       -7
    # CET        1
    # CEST       2
    

    Then, we tidy the datetimes up by removing the character suffix after the day number in date1 (the 'th' bit after '11th'). We also pull out the timezone code and put that in a separate column; the timezone column allows us to left_join() my_tz in, giving us the UTC offset.

    We use string-handling functions from the stringr package, and regex expressions to find, extract and replace the components. A very handy tool for testing regex patterns can be found here https://regex101.com/r/5pr3LL/1/

    my_dates %>%
      mutate(
        # remove the character suffix after the day number (eg 11th)
        day_suffix = str_extract(date1, '[0-9]+[a-z]+') %>% str_extract('[a-z]+'),
        date1 = str_replace(date1, day_suffix, ''),
        day_suffix = NULL,
    
        # extract timezone info
        date1_tz = str_extract(date1, '[a-zA-Z]+$'),
        date2_tz = str_extract(date2, '[a-zA-Z]+$'),
      ) %>%
    
      # join in timezones for date1
      left_join(my_tz, by = c('date1_tz' = 'tz')) %>%
      rename(
        offset_date1 = offset
      ) %>%
    
      # join in timezones for date2
      left_join(my_tz, by = c('date2_tz' = 'tz')) %>%
      rename(
        offset_date2 = offset
      ) %>% 
      
      {. ->> my_dates_info}
    
    my_dates_info
    
    # # A tibble: 2 x 6
    # date1                        date2                    date1_tz date2_tz offset_date1 offset_date2
    # <chr>                        <chr>                    <chr>    <chr>           <dbl>        <dbl>
    # February 11 2017, 6:05am PST 2013-12-14 00:58:00 CET  PST      CET                -8            1
    # April 24 2018, 4:09pm PDT    2013-06-19 18:00:00 CEST PDT      CEST               -7            2
    

    So now, we can use lubridate::as_datetime() to convert date1 and date2 to dttm (datetime) format. as_datetime() takes a character-format datetime and converts it to datetime format. You must specify the format of the character string using symbols and abbreviations explained here. For example, here we use %B to refer to the full name of the month, %d is the day number and %Y is the (4-digit) year number etc.

    Note: because we don't specify the timezone inside as_datetime(), the underlying timezone stored with these datetimes defaults to UTC (as seen by using tz()). This is why we call these columns date*_orig, to remind us the timezone is the original datetime's timezone. Then we add the offset to the datetime object, so we now have these times in UTC (and the underlying timezone signature of these values is UTC, so that's ideal).

    # now define datetimes in local and UTC timezones (note: technically the tz is UTC for both)
    my_dates_info %>% 
      mutate(
        date1_orig = as_datetime(date1, format = '%B %d %Y, %I:%M%p '),
        date1_utc = date1_orig + hours(offset_date1),
        date2_orig = as_datetime(date2, format = '%Y-%m-%d %H:%M:%S'),
        date2_utc = date2_orig + hours(offset_date2),
      ) %>% 
      {. ->> my_dates_utc}
    
    my_dates_utc
    
    # # A tibble: 2 x 10
    # date1                        date2                    date1_tz date2_tz offset_date1 offset_date2 date1_orig          date1_utc           date2_orig          date2_utc          
    # <chr>                        <chr>                    <chr>    <chr>           <dbl>        <dbl> <dttm>              <dttm>              <dttm>              <dttm>             
    # February 11 2017, 6:05am PST 2013-12-14 00:58:00 CET  PST      CET                -8            1 2017-02-11 06:05:00 2017-02-10 22:05:00 2013-12-14 00:58:00 2013-12-14 01:58:00
    # April 24 2018, 4:09pm PDT    2013-06-19 18:00:00 CEST PDT      CEST               -7            2 2018-04-24 16:09:00 2018-04-24 09:09:00 2013-06-19 18:00:00 2013-06-19 20:00:00
    

    Now that we have both sets of dates in datetime format, and in the same timezone, we can calculate time differences between them.

    # now calculate difference between them
    my_dates_utc %>% 
      select(date1_utc, date2_utc) %>% 
      mutate(
        difference_days = interval(start = date1_utc, end = date2_utc) %>% time_length(unit = 'days')
      )
    
    # # A tibble: 2 x 3
    # date1_utc           date2_utc           difference_days
    # <dttm>              <dttm>                        <dbl>
    # 2017-02-10 22:05:00 2013-12-14 01:58:00          -1155.
    # 2018-04-24 09:09:00 2013-06-19 20:00:00          -1770.
    

    This should be fine for small-scale operations. If you had more than 2 different datetime format vectors, it would be worth considering a more complex operation where you transform the data from wide to long format. This would save repeating the same/similar code for each column, like we have done for date1 and date2 in this example.