Problem:
I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.
Sample df:
import pandas as pd
data = {
'id': ['a', 'b', 'a', 'b'],
'year': [2005, 2005, 2010, 2010],
'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)
example input df:
_ id year val
0 a 2005 0
1 a 2010 100
2 b 2005 0
3 b 2010 100
expected output df:
_ id year val type
0 a 2005 0 original
1 a 2006 20 interpolated
2 a 2007 40 interpolated
3 a 2008 60 interpolated
4 a 2009 80 interpolated
5 a 2010 100 original
6 b 2005 0 original
7 b 2006 20 interpolated
8 b 2007 40 interpolated
9 b 2008 60 interpolated
10 b 2009 80 interpolated
11 b 2010 100 original
'type' is not necessary its just for illustration purposes.
Question:
How can I add missing years to the groupby() view and interpolate() their corresponding values?
Thank you!
Using a temporary reshaping with pivot
and unstack
and reindex
+interpolate
to add the missing years:
out = (df
.pivot(index='year', columns='id', values='val')
.reindex(range(df['year'].min(), df['year'].max()+1))
.interpolate('index')
.unstack(-1).reset_index(name='val')
)
Output:
id year val
0 a 2005 0.0
1 a 2006 20.0
2 a 2007 40.0
3 a 2008 60.0
4 a 2009 80.0
5 a 2010 100.0
6 b 2005 0.0
7 b 2006 20.0
8 b 2007 40.0
9 b 2008 60.0
10 b 2009 80.0
11 b 2010 100.0