I'm currently working with a dataframe which is routinely grouped into a MultiIndex of three or more levels, with fiscal Quarter always at the top level. Necessarily, a few calculated fields are added to the frame as year/year percent change within each unique index, easily obtained with a groupby up to but not including Quarter, and pd.pct_change().
Unfortunately, this only returns accurate values if a value exists for each possible Quarter. If I have a point for 2021Q1 and my next is 2021Q4, I need to pad in a row with zeroes for 2021Q2 and 2021Q3 in order for the year/year at 2021Q4 to not return 2021Q4/2021Q1. My problem is that I often have at least six and up to fifty unique index values at each level of the MultiIndex, and to pad it correctly I need as many rows as the are unique combinations, which quickly becomes an exponential disaster which makes the code unusable.
My question: Is it possible to take a Quarter/Quarter value respecting the MultiIndex without padding out every missing quarter on the index?
Reproducible example:
idx=pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
data=pd.DataFrame(index=idx)
data.loc[:,'paid']=np.random.randint(100,200,24)
correct_ex=data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
correct_ex.loc[('Canine','Samoyed',dt.datetime(2021,7,1)),'paid']=0
incorrect_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
correct_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
correct_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
incorrect_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
The correct_ex frame above contains the values that I would want to see if Samoyeds had no data for 7/1/2021, but the only way to get it is to keep a row with paid value 0 for that date. The incorrect_ex frame above is what I get if I attempt pct_change without the added row.
Thanks for the help!
You can calculate the percentage changes in a groupby
apply
and mask
rows where the difference between the cyq
dates (= last level of the index) is more than 93 days with np.inf
:
import numpy as np
import pandas as pd
import datetime as dt
idx = pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
np.random.seed(0)
data = pd.DataFrame({'paid': np.random.randint(100,200,24)}, index=idx)
data = data.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
data = data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
data['paid_change'] = data.groupby(['species','breed']).paid.apply(
lambda x: x.pct_change().mask(
x.index.get_level_values(-1).to_series().diff().gt(pd.Timedelta(days=93)),
np.inf
)
)
Result:
paid paid_change
species breed cyq
Canine Chihuahua 2021-04-01 144 NaN
2021-07-01 147 0.020833
2021-10-01 164 0.115646
2022-01-01 167 0.018293
Samoyed 2021-04-01 167 NaN
2021-10-01 183 inf
2022-01-01 121 -0.338798
Feline Shorthair 2021-04-01 181 NaN
2021-07-01 137 -0.243094
2021-10-01 125 -0.087591
2022-01-01 177 0.416000