Search code examples
pythonpandasmulti-index

Pct Change with Time Element


I'm currently working with a dataframe which is routinely grouped into a MultiIndex of three or more levels, with fiscal Quarter always at the top level. Necessarily, a few calculated fields are added to the frame as year/year percent change within each unique index, easily obtained with a groupby up to but not including Quarter, and pd.pct_change().

Unfortunately, this only returns accurate values if a value exists for each possible Quarter. If I have a point for 2021Q1 and my next is 2021Q4, I need to pad in a row with zeroes for 2021Q2 and 2021Q3 in order for the year/year at 2021Q4 to not return 2021Q4/2021Q1. My problem is that I often have at least six and up to fifty unique index values at each level of the MultiIndex, and to pad it correctly I need as many rows as the are unique combinations, which quickly becomes an exponential disaster which makes the code unusable.

My question: Is it possible to take a Quarter/Quarter value respecting the MultiIndex without padding out every missing quarter on the index?

Reproducible example:

idx=pd.MultiIndex.from_product([['Canine', 'Feline'],
                                ['Chihuahua','Samoyed','Shorthair'],
                                  [dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
                                 names=['species','breed','cyq'])

data=pd.DataFrame(index=idx)
data.loc[:,'paid']=np.random.randint(100,200,24)
correct_ex=data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
correct_ex.loc[('Canine','Samoyed',dt.datetime(2021,7,1)),'paid']=0
incorrect_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])

correct_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
correct_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
incorrect_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()

Correct Results: Correct Results

Incorrect Results: Incorrect Results

The correct_ex frame above contains the values that I would want to see if Samoyeds had no data for 7/1/2021, but the only way to get it is to keep a row with paid value 0 for that date. The incorrect_ex frame above is what I get if I attempt pct_change without the added row.

Thanks for the help!


Solution

  • You can calculate the percentage changes in a groupby apply and mask rows where the difference between the cyq dates (= last level of the index) is more than 93 days with np.inf:

    import numpy as np
    import pandas as pd
    import datetime as dt
    
    idx = pd.MultiIndex.from_product([['Canine', 'Feline'],
                                      ['Chihuahua','Samoyed','Shorthair'],
                                        [dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
                                       names=['species','breed','cyq'])
    np.random.seed(0)
    data = pd.DataFrame({'paid': np.random.randint(100,200,24)}, index=idx)
    data = data.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
    data = data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
    
    data['paid_change'] = data.groupby(['species','breed']).paid.apply(
                              lambda x: x.pct_change().mask(
                                  x.index.get_level_values(-1).to_series().diff().gt(pd.Timedelta(days=93)),
                                  np.inf
                              )
                          )
    

    Result:

                                  paid  paid_change
    species breed     cyq                          
    Canine  Chihuahua 2021-04-01   144          NaN
                      2021-07-01   147     0.020833
                      2021-10-01   164     0.115646
                      2022-01-01   167     0.018293
            Samoyed   2021-04-01   167          NaN
                      2021-10-01   183          inf
                      2022-01-01   121    -0.338798
    Feline  Shorthair 2021-04-01   181          NaN
                      2021-07-01   137    -0.243094
                      2021-10-01   125    -0.087591
                      2022-01-01   177     0.416000