Search code examples
pythonpython-3.xpandasgroup-byinterpolation

How to interpolate missing years within pd.groupby()


Problem:

I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.

Sample df:

import pandas as pd

data = {
    'id': ['a', 'b', 'a', 'b'],
    'year': [2005, 2005, 2010, 2010],
    'val': [0, 0, 100, 100],
    
}

df = pd.DataFrame.from_dict(data)

example input df:

_    id  year  val
0     a  2005    0
1     a  2010  100
2     b  2005    0
3     b  2010  100

expected output df:

_     id  year  val type
0      a  2005    0 original
1      a  2006   20 interpolated
2      a  2007   40 interpolated
3      a  2008   60 interpolated
4      a  2009   80 interpolated
5      a  2010  100 original
6      b  2005    0 original
7      b  2006   20 interpolated
8      b  2007   40 interpolated
9      b  2008   60 interpolated
10     b  2009   80 interpolated
11     b  2010  100 original

'type' is not necessary its just for illustration purposes.

Question:

How can I add missing years to the groupby() view and interpolate() their corresponding values?

Thank you!


Solution

  • Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:

    out = (df
       .pivot(index='year', columns='id', values='val')
       .reindex(range(df['year'].min(), df['year'].max()+1))
       .interpolate('index')
       .unstack(-1).reset_index(name='val')
    )
    

    Output:

       id  year    val
    0   a  2005    0.0
    1   a  2006   20.0
    2   a  2007   40.0
    3   a  2008   60.0
    4   a  2009   80.0
    5   a  2010  100.0
    6   b  2005    0.0
    7   b  2006   20.0
    8   b  2007   40.0
    9   b  2008   60.0
    10  b  2009   80.0
    11  b  2010  100.0