Search code examples
pythondatetimepandas-groupbydst

Group by hour whilst taking account of daylight savings


I am looking at shift data of a factory that works 24 hours a day. I want to group the data at each shift change which is 6:00 and 18:00. Up till now I have been trying to it with:

Data_Frame.groupby([pd.Grouper(freq='12H')]).count() 

However I have realised that since freq is set to 12H, it will always take a period of 12 hours including during daylight savings.

Unfortunately it is always 6:00 and 18:00 even when the clocks change. That means in reality there is one shift in the year that is 11 hours long and another that is 13 hours long so in the middle of the year group is off by 1 hour.

I feel that this is such a fundamental thing (daylight savings) that there should be some way of telling pandas that it needs to take account of daylight savings.

I have tried changing it from UTC to Europe/London however it still takes 12 hours periods.

Many Thanks

edit:

Only way I have found to do this is, before using groupby is to split my data into 3 (before first hour change, during hour change, second hour change) use groupby on each individually then putting them back together but this is irritating and tedious so anything better than this is hugely appreciated.


Solution

  • Hourly and 10 minute time-zone-aware time series' spanning spring dst change:

    ts_hrly = pd.date_range('03-10-2018', '3-13-2018', freq='H', tz='US/Eastern')
    ts_10m = pd.date_range('03-10-2018', '3-13-2018', freq='10T', tz='US/Eastern')
    

    Use the hourly data

    ts = ts_hrly
    df = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
    

    The dst transition looks like this:

    >>> df[18:23]
        period                    tstamp
    18      18 2018-03-11 00:00:00-05:00
    19      19 2018-03-11 01:00:00-05:00
    20      20 2018-03-11 03:00:00-04:00
    21      21 2018-03-11 04:00:00-04:00
    22      22 2018-03-11 05:00:00-04:00
    >>>
    

    To group into twelve hourly increments on 06:00 and 18:00 boundaries I assigned each observation to a shift number then grouped by the shift number

    My data conveniently starts at a shift change so calculate elapsed time since that first shift change:

    nanosec = df['tstamp'].values - df.iloc[0,1].value
    

    Find the shift changes and use np.cumsum() to assign shift numbers

    shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12)  == 0
    df['shift_nbr'] = shift_change.cumsum()
    gb = df.groupby(df['shift_nbr'])
    for k,g in gb:
        print(f'{k} has {len(g)} items')
    
    >>>
    1 has 12 items
    2 has 12 items
    3 has 12 items
    4 has 12 items
    5 has 12 items
    6 has 12 items
    

    I haven't found a way to compensate for data starting in the middle of a shift.

    If you want the groups for shifts affected by dst changes to have 11 or 13 items, change the timezone aware series to a timezone naive series

    df2 = pd.DataFrame({'tstamp':pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')),'period':range(len(ts))})
    

    Use the same process to assign and group by shift numbers

    nanosec = df2['tstamp'].values - df2.iloc[0,1].value
    shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12)  == 0
    df2['shift_nbr'] = shift_change.cumsum()
    
    for k,g in gb2:
        print(f'{k} has {len(g)} items')
    
    >>>
    1 has 12 items
    2 has 11 items
    3 has 12 items
    4 has 12 items
    5 has 12 items
    6 has 12 items
    7 has 1 items
    

    Unfortunately, pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')) takes some time. Here is a faster/better way to do it using the hour attribute of the timestamps to calculate elapsed hours - no need to create a separate timezone naive series, the hour attribute appears to be unaware. It also works for data starting in the middle of a shift.

    ts = pd.date_range('01-01-2018 03:00', '01-01-2019 06:00', freq='H', tz='US/Eastern')
    df3 = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
    
    shift_change = ((df3['tstamp'].dt.hour - 6) % 12) == 0
    shift_nbr = shift_change.cumsum()
    
    gb3 = df3.groupby(shift_nbr)
    
    print(sep,'gb3')
    for k,g in gb3:
        if len(g) != 12:
            print(f'shift starting {g.iloc[0,1]} has {len(g)} items')
    
    >>>
    shift starting 2018-01-01 03:00:00-05:00 has 3 items
    shift starting 2018-03-10 18:00:00-05:00 has 11 items
    shift starting 2018-11-03 18:00:00-04:00 has 13 items
    shift starting 2019-01-01 06:00:00-05:00 has 1 items