Search code examples
python-3.xpandasgroup-byinterpolation

Pandas interpolate within groupby on certain column for a specific input value


This is very similar to these two topics: Follow-up question Original question

I would need to interpolate not on all columns nor on a single column but on a set of columns (for instance two). Furthermore I would need to interpolate on a value which is most of the time missing from the reference column used for the interpolation. A small example will clarify:

Original dataframe:

    input_date    pivot_column    val1    val2
t
1   2023/11/02    16              3300    37
2   2023/11/02    14              3100    33
3   2023/11/02    10              3000    32.7
4   2023/11/03    17.5            3350    36.11
5   2023/11/03    15.75           3200    33.3
6   2023/11/03    14.25           3000    31.3

For each unique value in the column input_date, I would like to compute the interpolated values for columns val1 and val2 based on an hypothetical value of 15 from column pivot_column so something like

     input_date    pivot_column    val1    val2
t
1    2023/11/02    15              3200    35
2    2023/11/03    15              3100    32.3

I can use the solution from the follow-up question thread, ie

df["val2_interpolated"] = df[["filename","val2"]].groupby('filename')
.apply(lambda x:x) # WTF
.interpolate(method='linear')["val2"]

but how can I force the interpolation to be done around a pivot/input value of 15 for instance? do I need to apply my own interpolation function on a dataframe where I add a new row with 15 under the column pivot_column? or after grouping by the dataframe, apply np.interp?


Solution

  • I would concat the new rows with the value of 15 in pivot_column and then interpolate:

    other = pd.DataFrame({"input_date": df["input_date"].unique(), "pivot_column": 15})
    df2 = pd.concat([df,other])
    df3 = df2.groupby("input_date").apply(lambda g: g.set_index("pivot_column").interpolate("index")).drop(columns="input_date").reset_index()
    
    #keep only required rows
    output = other.merge(df3)
    
    >>> output
       input_date  pivot_column    val1  val2
    0  2023/11/02            15  3200.0  35.0
    1  2023/11/03            15  3100.0  32.3