Search code examples
pythonpandastime-seriesmissing-datamulti-index

Shifting column in multiindex dataframe with missting dates


I'd like to shift a column in a multiindex dataframe in order to calculate a regression model with a lagged independent variable. As my time-series has missing values I only want to have the values shifted for known previous days. The df looks like that:

                cost
ID  day
1   31.01.2020  0
1   03.02.2020  0
1   04.02.2020  0.12
1   05.02.2020  0
1   06.02.2020  0
1   07.02.2020  0.08
1   10.02.2020  0
1   11.02.2020  0
1   12.02.2020  0.03
1   13.02.2020  0.1
1   14.02.2020  0

The desired output would like that:

                cost   cost_lag
ID  day
1   31.01.2020  0      NaN
1   03.02.2020  0      NaN
1   04.02.2020  0.12   0
1   05.02.2020  0      0.12
1   06.02.2020  0      0
1   07.02.2020  0.08   0
1   10.02.2020  0      NaN
1   11.02.2020  0      0
1   12.02.2020  0.03   0
1   13.02.2020  0.1    0.03
1   14.02.2020  0      0.1 

Based on this answer to a similar question I've tried the following:

df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]

But that results in an error message I don't understand:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

I've also tried to fill the missing dates following an approach suggested here:

ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))

again resulting in an error message which does not enlighten me:

AttributeError: 'DatetimeIndex' object has no attribute 'apply'

Long story short: how can I shift the cost column by 1 day and add NaNs if I don't have data on the previous day?


Solution

  • You can add all missing datetimes by DataFrameGroupBy.resample with Resampler.asfreq:

    df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
    print (df1)
    ID  day       
    1   2020-01-31    0.00
        2020-02-01     NaN
        2020-02-02     NaN
        2020-02-03    0.00
        2020-02-04    0.12
        2020-02-05    0.00
        2020-02-06    0.00
        2020-02-07    0.08
        2020-02-08     NaN
        2020-02-09     NaN
        2020-02-10    0.00
        2020-02-11    0.00
        2020-02-12    0.03
        2020-02-13    0.10
        2020-02-14    0.00
    Name: cost, dtype: float64
    

    So then if use your solution with DataFrameGroupBy.shift it working like need:

    df['cost_lag'] = df1.groupby('ID').shift(1)
    print (df)
                   cost  cost_lag
    ID day                       
    1  2020-01-31  0.00       NaN
       2020-02-03  0.00       NaN
       2020-02-04  0.12      0.00
       2020-02-05  0.00      0.12
       2020-02-06  0.00      0.00
       2020-02-07  0.08      0.00
       2020-02-10  0.00       NaN
       2020-02-11  0.00      0.00
       2020-02-12  0.03      0.00
       2020-02-13  0.10      0.03
       2020-02-14  0.00      0.10