Search code examples
pythonpandasindexingmulti-index

Shift DateTime index within a Pandas MultiIndex


I have a csv file that looks like this when I load it:

# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')

# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y

>>> y.head()
user  datehour           
A     2020-02-01 00:00:00    0
      2020-02-01 01:00:00    0
      2020-02-01 02:00:00    1
      2020-02-01 03:00:00    0
      2020-02-01 04:00:00    0
Name: y, dtype: int64

I have the following function to create a lagged feature of an index level:

def shift_index(a, dt_idx_name, lag_freq, lag):

    # get datetime index of relevant level
    ac = a.copy()
    dti = ac.index.get_level_values(dt_idx_name)

    # shift it
    dti_shifted = dti.shift(lag, freq=lag_freq)

    # put it back where you found it
    ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)

    return ac

But when I run: y_lag = shift_index(y, 'datehour', 'H', 1), I get the following error:

ValueError: Level values must be unique...

(I can actually suppress this error by adding verify_integrity=False in .index.set_levels... in the function, but that (predictably) causes problems down the line)

Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour') shows a freq='H' attribute right after it's created, but freq=None once its reloaded from csv.

That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing. df.index.freq = pd.tseries.frequencies.to_offset("H"). And this answer also didn't work for my MultiIndex.

So I think I could solve this if I were able to set the freq attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y data with shifted DateTime MultiIndex component, such as with my shift_index function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.


Solution

  • After much fidgeting, I was able to set an hourly frequency using asfreq('H') on grouped data, such that each group has unique values for the datehour index.

    y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
    y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y
    

    Peeking at an index value shows the correct frequency.

    y.index[0]                                                                                                                                                                                                                          
    # ('A', Timestamp('2020-02-01 00:00:00', freq='H'))
    

    All this is doing is setting the index in two parts. The user goes first so that the nested datehour index can be unique within it. Once the datehour index is unique, then asfreq can be used without difficulty.

    If you try asfreq on a non-unique index, it will not work.

    y_load.set_index('datehour').asfreq('H')
    # ---------------------------------------------------------------------------
    # ValueError                                Traceback (most recent call last)
    # <ipython-input-433-3ba51b619417> in <module>
    # ----> 1 y_load.set_index('datehour').asfreq('H')
    # ...
    # ValueError: cannot reindex from a duplicate axis