I have a dataframe like as below
df = pd.DataFrame({'subject_id':[1,1,1,1,1,2,2,2,2,2],
'qtr_info' :['2017Q1','2017Q3','2017Q4','2018Q1','2018Q4','2017Q1','2017Q4','2018Q2','2018Q4','2019Q1'],
'val' :[5,5,5,5,1,6,5,5,8,3],
'Prod_id':['A','B','C','A','E','Q','G','F','G','H']})
I would like to do the below
a) Fill all the missing quarters of a subject
b) fillna for other columns using the mean value
for respective columns (for the same subject). Don't refer other subject ids for computing mean value
So, I tried the below
df_time.resample('Q').mean() #didn't work
df_time.groupby('subject_id').resample('Q).mean() #didn't work
But I got the below error
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
I expect my output to be like as below for subject_id = 1. Here yellow color shows newly inserted rows (to fill the missing gaps)
Create PeriodIndex
first and for resample use Resampler.agg
with dictionary, for replace missing values by means use Series.fillna
with GroupBy.transform
:
df_time['qtr_info'] = pd.PeriodIndex(df_time['qtr_info'], freq='Q')
df = (df_time.set_index('qtr_info')
.groupby('subject_id')
.resample('Q')
.agg({'val':'mean', 'Prod_id':'ffill'}))
df['val'] = df['val'].fillna(df.groupby(level=0)['val'].transform('mean'))
print (df)
val Prod_id
subject_id qtr_info
1 2017Q1 5.0 A
2017Q2 4.2 A
2017Q3 5.0 B
2017Q4 5.0 C
2018Q1 5.0 A
2018Q2 4.2 A
2018Q3 4.2 A
2018Q4 1.0 E
2 2017Q1 6.0 Q
2017Q2 5.4 Q
2017Q3 5.4 Q
2017Q4 5.0 G
2018Q1 5.4 G
2018Q2 5.0 F
2018Q3 5.4 F
2018Q4 8.0 G
2019Q1 3.0 H