Search code examples
pythonpandasdataframedatetimedst

Pandas DataFrame convert data from BST/Clock to GMT/UTC and vice versa


I’m hoping although this seems complex that someone has solved something similar as it is essentially handling daylight savings time.

I have some equipment that records temperature. Some probes record time in GMT/UTC and others record the time in BST/Clock.

The probes record temperature data every hour so for probes that record data GMT/UTC, the data has 24 columns with column 0 corresponding to 00:00, column 1 to 01:00 etc every day of the year.

In the UK the clocks go forward 1 hour at 1am on the last Sunday in March, and back 1 hour at 2am on the last Sunday in October.

For the probes that record data in BST/Clock time, when the clocks go forward, there are only 23 hourly periods of data instead of the normal 24. When the clocks go back there are 25 hourly Periods instead 24.

There are 3 parts to the question. The first is I want to standardise the data in one dataframe in GMT/UTC format so on the day when the clocks go forward and I have 23 readings, I need to move the reading in the first column in the next day/row to column 24 of the previous day. I then need to keep moving all column data, one place to the left. I need to repeat this process of moving the first column of a row to the last column of the previous row and then moving all other columns in the row across 1 place up to and including until the last Sunday in October where there are 25 readings. Here’s an example of what I’m trying to achieve:

# BST/Clock Data Format 
bst = [{0:9, 1:6, 2:7, 3:4, 22:2, 23:1},
        {0:10, 1:12, 2:9, 3:8, 22:16},
        {0:11, 1:9, 2:8, 3:12, 22:15, 23:16}, 
        {0:1, 1:6, 2:5, 3:7, 22:6, 23:8, 24:9},
        {0:2, 1:2, 2:4, 3:4, 22:3, 23:2}] 


# Create BST/Clock df
df_bst = pd.DataFrame(bst, index=['24/03/2018', '25/03/2018', '26/03/2018', '28/10/2018', '29/10/2018'])
df_bst.index = pd.to_datetime(df_bst.index, dayfirst=True)


# Format of what GMT/UTC data should look like
gmt = [{0:9, 1:6, 2:7, 3:4, 22:2, 23:1},
        {0:10, 1:12, 2:9, 3:8, 22:16, 23:11},
        {0:9, 1:8, 2:12, 3:15, 22:16, 23:5}, 
        {0:6, 1:5, 2:7, 3:6, 22:8, 23:9},
        {0:2, 1:2, 2:4, 3:4, 22:3, 23:2}] 

df_gmt = pd.DataFrame(gmt, index=['24/03/2018', '25/03/2018', '26/03/2018', '28/10/2018', '29/10/2018'])
df_gmt.index = pd.to_datetime(df_gmt.index, dayfirst=True)

print('Initial format')
print(df_bst)
print()
print('What data should look like after translation with the last Sunday in Mar 2018 and Oct 2018 being 25/03/2018 and 28/10/2018 respectively')
print(df_gmt)

The second part is I would like to calculate the last Sunday in March and October for the data in the dataframe. I was working on something like:

for month in (3, 10):
    last_sunday = max([week[-1] for week in calendar.monthcalendar(2018, month)])
    x = datetime.datetime(2018, month, last_sunday)
    print(x.strftime('%d/%m/%Y'))

And extracting the year using the DatetimeIndex.year but there might be times when they dataset spans more than a year so I would need to apply the transformation between 31/03/2018 - 28/10/2018 and then again between 31/03/2019 - today.

The last part of the question is occasionally I would like to report on average temperature between 14:00 BST/Clock - 18:00 BST/Clock. What would be the best way to do this given my data is stored in GMT/UTC?

I’ve tried different options including localize but to no avail.


Solution

  • Here goes my attempt to answer all three questions. Perhaps all you need is some logic and use a timezone package, such as ptyz, which helps you understand if the timezone is in place or not, as opposed to try to calculate it yourself. Specifically there are two timezones that might be of interest for you:

    import pytz
    
    utc = pytz.utc #utc all year round
    btc = pytz.timezone('Europe/London') #utc with the time difference already there.
    

    Depending on the time of the year (btc or not), the dates are identical or not:

    dt1 = datetime.datetime.strptime('2018-03-24','%Y-%m-%d')
    dt2 = datetime.datetime.strptime('2018-03-26','%Y-%m-%d')
    
    print ('Different:', utc.localize(dt1), btc.localize(dt1))
    print ('Identical:', utc.localize(dt2), btc.localize(dt2))
    
    [out]:
    Different: 2018-03-24 00:00:00+00:00 2018-03-24 00:00:00+00:00
    Identical: 2018-03-26 00:00:00+00:00 2018-03-26 00:00:00+01:00
    
    

    Using those two timezones, you can then compare if the date is the same under both timezones to identify the start and period of BTC. For instance:

    import pytz
    import pandas as pd
    import datetime
    
    utc = pytz.utc
    btc = pytz.timezone('Europe/London')
    
    bst = [{0:9, 1:6, 2:7, 3:4, 22:2, 23:1},
            {0:10, 1:12, 2:9, 3:8, 22:16},
            {0:11, 1:9, 2:8, 3:12, 22:15, 23:16}, 
            {0:1, 1:6, 2:5, 3:7, 22:6, 23:8, 24:9},
            {0:2, 1:2, 2:4, 3:4, 22:3, 23:2}]
    
    df = pd.DataFrame(bst)
    
    df['dates'] =  ['24/03/2018', '25/03/2018', '26/03/2018', '28/10/2018', '29/10/2018']
    
    # date on utc
    df['dates_utc'] = df['dates'].apply(lambda x: utc.localize(datetime.datetime.strptime(x,'%d/%m/%Y')))
    
    # date on Europe/London
    df['dates_wdtz'] = df['dates'].apply(lambda x: btc.localize(datetime.datetime.strptime(x,'%d/%m/%Y')))
    
    # check if is a btc day
    df['is_btc'] = df['dates_utc'] > df['dates_wdtz']
    

    Then we can create a temporary flag for the first btc day, as it is the only day where only the last hour is modified:

    df['btc_starts'] = df['is_btc'].shift(-1)
    df['btc_first_day'] = (df['is_btc']==False) & (df['btc_starts'] == True)
    

    And modify that specific day:

    ix_first_day = df[df['btc_first_day']==True].index
    df.loc[ix_first_day, 23] = int(df.loc[ix_first_day+1, 0])
    

    For all other days we can then simply apply the -1 logic to all hour columns:

    btc_days = df[df['is_btc']==True].index
    
    for hour in range(0,25,1):
        if hour == 24:
            df.loc[btc_days, hour] = df.loc[btc_days + 1, hour]
        else:
            df.loc[btc_days, hour] = df.loc[btc_days, hour+1]
    
    ## drop temporary columns
    df.drop(['dates_utc','dates_wdtz','is_btc','btc_starts','btc_first_day'], axis=1, inplace=True)
    

    This will provide us with the following result:

    Out[15]: 
        0   1   2   3    22    23   24       dates
    0   9   6   7   4   2.0   1.0  NaN  24/03/2018
    1  10  12   9   8  16.0  11.0  NaN  25/03/2018
    2  12  12  12  12  16.0   NaN  NaN  26/03/2018
    3   7   7   7   7   8.0   9.0  9.0  28/10/2018
    4   2   2   4   4   3.0   2.0  NaN  29/10/2018
    

    Note that I did not use the dates as the index, and I am assuming you have information for all dates, and therefore index + 1 is always the next date. If that is not the case then you loop though the index using dates, and instead of add 1, add 1 day to datetime.

    With the above: - You can automatically understand if a date is on btc or not - You can reformat the date if you take the dates and change the datetime, such as mydate.astimezone(btc). See more at the pytz docs - You do not need to calculate when the timezone will occur.