Consider the following pandas dataframe
reference sicovam label id date TTM price
0 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 18 52.69
1 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 30 NaN
2 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 49 53.11
3 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 60 NaN
4 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 77 53.69
5 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 90 NaN
6 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 109 54.42
7 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 137 55.15
8 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 171 55.80
9 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 180 NaN
10 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 15 50.04
11 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 30 NaN
12 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 46 50.52
13 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 60 NaN
14 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 74 51.17
15 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 90 NaN
16 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 106 51.95
17 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 134 52.73
18 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 168 53.46
19 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-05 180 NaN
After grouping by the reference
, sicovam
, label
, id
and date
columns, I would like to fill the NaN
values of the price
column via linear interpolation over the TTM
value i.e., in the context of the linear interpolation formula, price
is the y
and TTM
is the x
variable.
So far, I built the following lines.
def intepolate_group(group):
group["price"] = group["price"].interpolate(method='linear', limit_direction='both', axis=0)
return group
new_df = df.groupby(["reference","sicovam","label","id","date"])[["TTM","price"]].apply(intepolate_group)
Nevertheless, the result that I get is the linear interpolation over the index numbers per group. For example for the following part of the dataset, I get 54.06
instead of 53.99
. What do I still need in order to interpolate over the TTM variable?
PS: I want to avoid masking via loop (instead of grouping) and setting the TTM
as the index, because the dataframe is quite big and such a scenario takes considerable amount of time.
4 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 77 53.69
5 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 90 NaN
6 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 109 54.42
Here's one approach:
cols = ['reference', 'sicovam', 'label', 'id', 'date']
df['price'] = (
df.set_index('TTM')
.groupby(cols)['price']
.transform(lambda x: x.interpolate(method='index'))
.values
)
Output:
df.iloc[4:7]
reference sicovam label id date TTM price
4 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 77 53.690000
5 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 90 53.986562
6 SCOM_WTI 68801903 WTI Nymex BBG:CL 2015-01-02 109 54.420000
Explanation
df.set_index
and apply df.groupby
.groupby.transform
and use pd.Series.interpolate
with method='index'
.Series.values
to assign the result back to df['price']
.