I have multi-index data and need to fill NaNs in between the first and last available value per ID

I have a sample dataset like below. Please note that ['ID', 'Date'] constitute the index.

ID	Date	ValueA	ValueB	ValueC
A	01-01-2020	NaN	NaN	NaN
A	01-02-2020	1	5	7
A	01-03-2020	NaN	6	8
A	01-04-2020	2	NaN	NaN
A	01-05-2020	2	NaN	7
B	01-01-2020	2	NaN	6
B	01-02-2020	3	5	7
B	01-03-2020	NaN	NaN	NaN
B	01-04-2020	5	NaN	10
B	01-05-2020	5	8	NaN

Below is the desired output

ID	Date	ValueA	ValueB	ValueC
A	01-01-2020	NaN	NaN	NaN
A	01-02-2020	1	5	7
A	01-03-2020	1	6	8
A	01-04-2020	2	NaN	8
A	01-05-2020	2	NaN	7
B	01-01-2020	2	NaN	6
B	01-02-2020	3	5	7
B	01-03-2020	3	5	7
B	01-04-2020	5	5	10
B	01-05-2020	5	8	NaN

For each ID, I need to fill NaN values in ValueA/ValueB/ValueC but the area is limited to inside. I've tried:

df_padded = df.interpolate(limit_area='inside')
df_fill = df.fillna(method="ffill")
mask = df_padded.isna()
df_fill[mask] = np.nan

Above code works when dealing with a single ID. But since there are multiple IDs present in the dataset the output becomes:

ID	Date	ValueA	ValueB	ValueC
A	01-01-2020	NaN	NaN	NaN
A	01-02-2020	1	5	7
A	01-03-2020	1	6	8
A	01-04-2020	2	6	8
A	01-05-2020	2	6	7
B	01-01-2020	2	6	6
B	01-02-2020	3	5	7
B	01-03-2020	3	5	7
B	01-04-2020	5	5	10
B	01-05-2020	5	8	NaN

I've also tried applying below function, but could not set it to multi-index to see if I can then apply the mask.

def fwdfill(df_monthlies):
    ID_list = df_monthlies['ID'].unique().tolist()
    for ID in ID_LIST:
        ID_data = df_monthlies[df_monthlies['ID'] == ID]
        ID_data = ID_data.sort_values(by=['Date'])
        ID_data = ID_data.reset_index(drop=True)

Solution

Use df.reset_index + df.groupby, and then:

Option 1

groupby.apply + df.interpolate with "pad" on axis=1 ("pad" is not possible with a MultiIndex, hence the need to reset the index first).

out = (
    df.reset_index()
      .groupby('ID', group_keys=False)
      .apply(lambda x: x.interpolate('pad', limit_area='inside', axis=1))
      .set_index(['ID', 'Date'])
)

Option 2

transform

The basic operation being:

cols = ['ValueA', 'ValueB', 'ValueC']

(
    df.reset_index()
      .groupby('ID')[cols]
      .transform(lambda x: x.interpolate('pad', limit_area='inside'))
)

You can assign that back in a one-liner:

out2 = (
    df.reset_index()
      .pipe(
          lambda x: x.assign(
              **x.groupby('ID')[cols].transform(
                  lambda x: x.interpolate('pad', limit_area='inside')
                  )
              )
          )
      .set_index(['ID', 'Date'])
)

out2.equals(out)
# True

Output:

               ValueA  ValueB  ValueC
ID Date                              
A  01-01-2020     NaN     NaN     NaN
   01-02-2020     1.0     5.0     7.0
   01-03-2020     1.0     6.0     8.0
   01-04-2020     2.0     NaN     8.0
   01-05-2020     2.0     NaN     7.0
B  01-01-2020     2.0     NaN     6.0
   01-02-2020     3.0     5.0     7.0
   01-03-2020     3.0     5.0     7.0
   01-04-2020     5.0     5.0    10.0
   01-05-2020     5.0     8.0     NaN

Data used

import pandas as pd
import numpy as np

data = {'ID': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 
               7: 'B', 8: 'B', 9: 'B'}, 
        'Date': {0: '01-01-2020', 1: '01-02-2020', 2: '01-03-2020', 3: '01-04-2020', 
                 4: '01-05-2020', 5: '01-01-2020', 6: '01-02-2020', 7: '01-03-2020', 
                 8: '01-04-2020', 9: '01-05-2020'}, 
        'ValueA': {0: np.nan, 1: 1.0, 2: np.nan, 3: 2.0, 4: 2.0, 5: 2.0, 6: 3.0, 
                   7: np.nan, 8: 5.0, 9: 5.0}, 
        'ValueB': {0: np.nan, 1: 5.0, 2: 6.0, 3: np.nan, 4: np.nan, 5: np.nan, 
                   6: 5.0, 7: np.nan, 8: np.nan, 9: 8.0}, 
        'ValueC': {0: np.nan, 1: 7.0, 2: 8.0, 3: np.nan, 4: 7.0, 5: 6.0, 6: 7.0, 
                   7: np.nan, 8: 10.0, 9: np.nan}}

df = pd.DataFrame(data).set_index(['ID', 'Date'])