Search code examples
pythonpandaspivotinterpolation

I have multi-index data and need to fill NaNs in between the first and last available value per ID


I have a sample dataset like below. Please note that ['ID', 'Date'] constitute the index.

ID Date ValueA ValueB ValueC
A 01-01-2020 NaN NaN NaN
A 01-02-2020 1 5 7
A 01-03-2020 NaN 6 8
A 01-04-2020 2 NaN NaN
A 01-05-2020 2 NaN 7
B 01-01-2020 2 NaN 6
B 01-02-2020 3 5 7
B 01-03-2020 NaN NaN NaN
B 01-04-2020 5 NaN 10
B 01-05-2020 5 8 NaN

Below is the desired output

ID Date ValueA ValueB ValueC
A 01-01-2020 NaN NaN NaN
A 01-02-2020 1 5 7
A 01-03-2020 1 6 8
A 01-04-2020 2 NaN 8
A 01-05-2020 2 NaN 7
B 01-01-2020 2 NaN 6
B 01-02-2020 3 5 7
B 01-03-2020 3 5 7
B 01-04-2020 5 5 10
B 01-05-2020 5 8 NaN

For each ID, I need to fill NaN values in ValueA/ValueB/ValueC but the area is limited to inside. I've tried:

df_padded = df.interpolate(limit_area='inside')
df_fill = df.fillna(method="ffill")
mask = df_padded.isna()
df_fill[mask] = np.nan

Above code works when dealing with a single ID. But since there are multiple IDs present in the dataset the output becomes:

ID Date ValueA ValueB ValueC
A 01-01-2020 NaN NaN NaN
A 01-02-2020 1 5 7
A 01-03-2020 1 6 8
A 01-04-2020 2 6 8
A 01-05-2020 2 6 7
B 01-01-2020 2 6 6
B 01-02-2020 3 5 7
B 01-03-2020 3 5 7
B 01-04-2020 5 5 10
B 01-05-2020 5 8 NaN

I've also tried applying below function, but could not set it to multi-index to see if I can then apply the mask.

def fwdfill(df_monthlies):
    ID_list = df_monthlies['ID'].unique().tolist()
    for ID in ID_LIST:
        ID_data = df_monthlies[df_monthlies['ID'] == ID]
        ID_data = ID_data.sort_values(by=['Date'])
        ID_data = ID_data.reset_index(drop=True)

Solution

  • Use df.reset_index + df.groupby, and then:

    Option 1

    groupby.apply + df.interpolate with "pad" on axis=1 ("pad" is not possible with a MultiIndex, hence the need to reset the index first).

    out = (
        df.reset_index()
          .groupby('ID', group_keys=False)
          .apply(lambda x: x.interpolate('pad', limit_area='inside', axis=1))
          .set_index(['ID', 'Date'])
    )
    

    Option 2

    transform

    The basic operation being:

    cols = ['ValueA', 'ValueB', 'ValueC']
    
    (
        df.reset_index()
          .groupby('ID')[cols]
          .transform(lambda x: x.interpolate('pad', limit_area='inside'))
    )
    

    You can assign that back in a one-liner:

    out2 = (
        df.reset_index()
          .pipe(
              lambda x: x.assign(
                  **x.groupby('ID')[cols].transform(
                      lambda x: x.interpolate('pad', limit_area='inside')
                      )
                  )
              )
          .set_index(['ID', 'Date'])
    )
    
    out2.equals(out)
    # True
    

    Output:

                   ValueA  ValueB  ValueC
    ID Date                              
    A  01-01-2020     NaN     NaN     NaN
       01-02-2020     1.0     5.0     7.0
       01-03-2020     1.0     6.0     8.0
       01-04-2020     2.0     NaN     8.0
       01-05-2020     2.0     NaN     7.0
    B  01-01-2020     2.0     NaN     6.0
       01-02-2020     3.0     5.0     7.0
       01-03-2020     3.0     5.0     7.0
       01-04-2020     5.0     5.0    10.0
       01-05-2020     5.0     8.0     NaN
    

    Data used

    import pandas as pd
    import numpy as np
    
    data = {'ID': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 
                   7: 'B', 8: 'B', 9: 'B'}, 
            'Date': {0: '01-01-2020', 1: '01-02-2020', 2: '01-03-2020', 3: '01-04-2020', 
                     4: '01-05-2020', 5: '01-01-2020', 6: '01-02-2020', 7: '01-03-2020', 
                     8: '01-04-2020', 9: '01-05-2020'}, 
            'ValueA': {0: np.nan, 1: 1.0, 2: np.nan, 3: 2.0, 4: 2.0, 5: 2.0, 6: 3.0, 
                       7: np.nan, 8: 5.0, 9: 5.0}, 
            'ValueB': {0: np.nan, 1: 5.0, 2: 6.0, 3: np.nan, 4: np.nan, 5: np.nan, 
                       6: 5.0, 7: np.nan, 8: np.nan, 9: 8.0}, 
            'ValueC': {0: np.nan, 1: 7.0, 2: 8.0, 3: np.nan, 4: 7.0, 5: 6.0, 6: 7.0, 
                       7: np.nan, 8: 10.0, 9: np.nan}}
    
    df = pd.DataFrame(data).set_index(['ID', 'Date'])