Search code examples
pythonpandasgroup-byinterpolation

Python Pandas: Groupby multiple columns and linearly interpolate values of column Y based on another X column


Consider the following pandas dataframe

    reference   sicovam     label       id      date        TTM price
0   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  18  52.69
1   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  30  NaN
2   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  49  53.11
3   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  60  NaN
4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42
7   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  137 55.15
8   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  171 55.80
9   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  180 NaN
10  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  15  50.04
11  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  30  NaN
12  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  46  50.52
13  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  60  NaN
14  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  74  51.17
15  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  90  NaN
16  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  106 51.95
17  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  134 52.73
18  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  168 53.46
19  SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-05  180 NaN

After grouping by the reference, sicovam, label, id and date columns, I would like to fill the NaN values of the price column via linear interpolation over the TTM value i.e., in the context of the linear interpolation formula, price is the y and TTM is the x variable.

So far, I built the following lines.

def intepolate_group(group):
    group["price"] = group["price"].interpolate(method='linear', limit_direction='both', axis=0)
    return group

new_df = df.groupby(["reference","sicovam","label","id","date"])[["TTM","price"]].apply(intepolate_group)

Nevertheless, the result that I get is the linear interpolation over the index numbers per group. For example for the following part of the dataset, I get 54.06 instead of 53.99. What do I still need in order to interpolate over the TTM variable?

PS: I want to avoid masking via loop (instead of grouping) and setting the TTM as the index, because the dataframe is quite big and such a scenario takes considerable amount of time.

4   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  77  53.69
5   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  90  NaN
6   SCOM_WTI    68801903    WTI Nymex   BBG:CL  2015-01-02  109 54.42

Solution

  • Here's one approach:

    cols = ['reference', 'sicovam', 'label', 'id', 'date']
    
    df['price'] = (
        df.set_index('TTM')
        .groupby(cols)['price']
        .transform(lambda x: x.interpolate(method='index'))
        .values
        )
    

    Output:

    df.iloc[4:7]
    
                reference sicovam  label      id        date  TTM      price
    4 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02   77  53.690000
    5 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02   90  53.986562
    6 SCOM_WTI   68801903     WTI  Nymex  BBG:CL  2015-01-02  109  54.420000
    

    Explanation