Search code examples
python-3.xpandasdatetimetimezonetimestamp-with-timezone

`combine_first` at DST changing day fails


I am trying to use combine_first to join two pandas series, so that one has precedence over the other. But it fails on the DST change day. I put together this demonstration:

import pandas as pd
import numpy as np

fr1 = pd.date_range(pd.to_datetime('2020-10-25').tz_localize('Europe/Berlin'), pd.to_datetime('2020-10-26').tz_localize('Europe/Berlin'), freq='H')
fr2 = fr1 + pd.DateOffset(hours=12)
d1 = pd.Series(data=np.random.randint(0,10, len(fr1)), index = fr1)
d2 = pd.Series(data=np.random.randint(0,10, len(fr2)), index = fr2)
d2.combine_first(d1)

output:

ValueError: cannot reindex from a duplicate axis

Why is this? How can this be fixed? I can of course manually slice the series and concatenate.


Solution

  • As you can see looking at the first items of your DateTimeIndex, the timezone as +xxxx changes at the moment of DST change:

    >>> date_range = pd.date_range(pd.to_datetime('2020-10-25').tz_localize('Europe/Berlin'), pd.to_datetime('2020-10-26').tz_localize('Europe/Berlin'), freq='H')
    >>> date_range[:4]
    DatetimeIndex(['2020-10-25 00:00:00+02:00', '2020-10-25 01:00:00+02:00',
                   '2020-10-25 02:00:00+02:00', '2020-10-25 02:00:00+01:00'],
    

    When you add the 12 hours, the dates change their timezones too:

    DatetimeIndex(['2020-10-25 12:00:00+01:00', '2020-10-25 13:00:00+01:00',
                   '2020-10-25 14:00:00+01:00', '2020-10-25 14:00:00+01:00'],
                  dtype='datetime64[ns, Europe/Berlin]', freq=None)
    

    However the difference in time is not 12 hours anymore, it is 11 due to the timezone change. That is because when adding DateOffset to a DatetimeIndex, you ask for a « apparent » change of 12 hours, as in « the 2 different clock readings are 12 hours apart ». This means that then 2 timestamps are mapped to the same timestamp that appears 12 hours later.

    It is a specificity of DateOffset to allow to express non-constant deltas, i.e. if you add « 1 month », you would similarly not add the same number of seconds to a date depending which month you’re in.

    If you want a shift in 12 actual hours, use Timedelta:

    >>> (date_range[:4] + pd.Timedelta(hours=12))
    DatetimeIndex(['2020-10-25 11:00:00+01:00', '2020-10-25 12:00:00+01:00',
                   '2020-10-25 13:00:00+01:00', '2020-10-25 14:00:00+01:00'],
                  dtype='datetime64[ns, Europe/Berlin]', freq='H')
    >>> (date_range + pd.Timedelta(hours=12)).is_unique
    True
    

    Now that the index is unique, combine_first will also work:

    >>> d1 = pd.Series(data=np.random.randint(0, 10, 26), index=date_range)
    >>> d2 = pd.Series(data=np.random.randint(0, 10, 26), index=date_range + pd.Timedelta(hours=12))
    >>> d2.combine_first(d1)
    2020-10-25 00:00:00+02:00    0.0
    2020-10-25 01:00:00+02:00    1.0
    2020-10-25 02:00:00+02:00    2.0
    2020-10-25 02:00:00+01:00    1.0
    2020-10-25 03:00:00+01:00    7.0
    2020-10-25 04:00:00+01:00    4.0
    2020-10-25 05:00:00+01:00    0.0
    2020-10-25 06:00:00+01:00    1.0
    2020-10-25 07:00:00+01:00    1.0
    2020-10-25 08:00:00+01:00    1.0
    2020-10-25 09:00:00+01:00    6.0
    2020-10-25 10:00:00+01:00    7.0
    2020-10-25 11:00:00+01:00    4.0
    2020-10-25 12:00:00+01:00    8.0
    2020-10-25 13:00:00+01:00    6.0
    2020-10-25 14:00:00+01:00    0.0
    2020-10-25 15:00:00+01:00    6.0
    2020-10-25 16:00:00+01:00    0.0
    2020-10-25 17:00:00+01:00    0.0
    2020-10-25 18:00:00+01:00    9.0
    2020-10-25 19:00:00+01:00    7.0
    2020-10-25 20:00:00+01:00    9.0
    2020-10-25 21:00:00+01:00    3.0
    2020-10-25 22:00:00+01:00    4.0
    2020-10-25 23:00:00+01:00    0.0
    2020-10-26 00:00:00+01:00    6.0
    2020-10-26 01:00:00+01:00    5.0
    2020-10-26 02:00:00+01:00    9.0
    2020-10-26 03:00:00+01:00    1.0
    2020-10-26 04:00:00+01:00    4.0
    2020-10-26 05:00:00+01:00    4.0
    2020-10-26 06:00:00+01:00    3.0
    2020-10-26 07:00:00+01:00    1.0
    2020-10-26 08:00:00+01:00    8.0
    2020-10-26 09:00:00+01:00    1.0
    2020-10-26 10:00:00+01:00    6.0
    2020-10-26 11:00:00+01:00    5.0
    2020-10-26 12:00:00+01:00    6.0
    Freq: H, dtype: float64