Search code examples
pythonpandaspandas-groupby

Pandas: Fill NaNs with next non-NaN / # consecutive NaNs


I'm looking to take a pandas series and fill NaN with the average of the next numerical value where: average = next numerical value / (# consecutive NaNs + 1)

Here's my code so far, I just can't figure out how to divide the filler column among the NaNs (and the next numerical value as well) in num:

import pandas as pd

dates = pd.date_range(start = '1/1/2016',end = '1/12/2016', freq = 'D')
nums = [10, 12, None, None, 39, 10, 11, None, None, None, None, 60]

df = pd.DataFrame({
        'date':dates, 
        'num':nums
        })

df['filler'] = df['num'].fillna(method = 'bfill')

Current Output:

         date   num  filler
0  2016-01-01  10.0    10.0
1  2016-01-02  12.0    12.0
2  2016-01-03   NaN    39.0
3  2016-01-04   NaN    39.0
4  2016-01-05  39.0    39.0
5  2016-01-06  10.0    10.0
6  2016-01-07  11.0    11.0
7  2016-01-08   NaN    60.0
8  2016-01-09   NaN    60.0
9  2016-01-10   NaN    60.0
10 2016-01-11   NaN    60.0
11 2016-01-12  60.0    60.0

Desired Output:

         date   num
0  2016-01-01  10.0
1  2016-01-02  12.0
2  2016-01-03  13.0
3  2016-01-04  13.0
4  2016-01-05  13.0
5  2016-01-06  10.0
6  2016-01-07  11.0
7  2016-01-08  12.0
8  2016-01-09  12.0
9  2016-01-10  12.0
10 2016-01-11  12.0
11 2016-01-12  12.0

Solution

    • Take a reverse cumsum of notnull
    • Use that to groupby and transform with mean

    csum = df.num.notnull()[::-1].cumsum()
    filler = df.num.fillna(0).groupby(csum).transform('mean')
    df.assign(filler=filler)
    
             date   num  filler
    0  2016-01-01  10.0    10.0
    1  2016-01-02  12.0    12.0
    2  2016-01-03   NaN    13.0
    3  2016-01-04   NaN    13.0
    4  2016-01-05  39.0    13.0
    5  2016-01-06  10.0    10.0
    6  2016-01-07  11.0    11.0
    7  2016-01-08   NaN    12.0
    8  2016-01-09   NaN    12.0
    9  2016-01-10   NaN    12.0
    10 2016-01-11   NaN    12.0
    11 2016-01-12  60.0    12.0
    

    how it works

    • df.num.notnull().cumsum() is a standard technique to find groups of contiguous nulls. However, I wanted my groups to end with the next numeric value. So I reversed the series and then cumsum'd.
    • I want my average to include the number of nulls. Easiest way to do that is to fill with zero and take a normal mean over the groups I I just made.
    • transform to broadcast across the existing index
    • assign new column. Despite having reversed the series, the index will realign like magic. Could have used loc but that overwrites the existing df. I'll let OP decide to overwrite if they want to.