Search code examples
pythonpandasdateforecasting

How to deal with change from summer to winter time when using resampling in pandas


I have a dataframe that I would like to use for a load forecast. The data is recordes every 15 minutes. I would first like to resample the data by calculating the mean for every hour. For this purpose I use the resampling function from pandas new_df = df['Load'].resample('1H').mean()

Bascially this is not a problem as long as the time data is consistant regarding winter and summer time. However, when having a switch this get problematic (and I get an Type Error). So here you see an extract of the data that shows the change from summer to winter time enter image description here

and here you see the data when changing from winter time to summer time:

enter image description here

In the first case some hour values (02:00, 02:15, 02:30, 02:45) exist 2 time this is why I get an error when trying to resampe it. In the latter case the values for these specific hours are missing. Now my question is how to deal with that problem? Is there a way to tell pandas that the one is summer time and the other one is winter time? In the data itself this is indicated with the addition "+02:00" or "+01:00" but this is still a problem for pandas. Or can you think about any other solution? One naive approach would be just to delete the 4 data records when changing from summer to winter time (first case) and to copy the previous 4 data record when changing from winter to summer time.

Reminder: Does nobody have an idea or a suggestion? This problem should occur also for others.


Solution

  • It looks like you have an Index of timestamps (or maybe strings), but not a DatetimeIndex, because a DatetimeIndex cannot hold timestamps with heterogenous timezone information.

    This is unwieldy, so my suggestion is to create a proper DatetimeIndex first, and then do whatever resampling you need to do afterwards.

    Here's a small DataFrame with mixed timezone timestamps:

    df = pd.DataFrame(
        [1, 2],
        index=pd.Index([pd.Timestamp('12:00:00+01:00'), pd.Timestamp('12:00:00+00:00')])
    )
    

    It does not have a DatetimeIndex:

    >>> df
                               0
    2021-05-31 12:00:00+01:00  1
    2021-05-31 12:00:00+00:00  2
    >>> type(df.index)
    <class 'pandas.core.indexes.base.Index'>
    

    In order to construct a DatetimeIndex for df, we need to decide on a common timezone, so let's use UTC+0.

    df.index = pd.to_datetime(df.index, utc=True)
    

    The DataFrame now looks like this and has a DatetimeIndex:

    >>> df
                               0
    2021-05-31 11:00:00+00:00  1
    2021-05-31 12:00:00+00:00  2
    >>> type(df.index)
    <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
    

    Notice how 2021-05-31 12:00:00+01:00 became 2021-05-31 11:00:00+00:00.