Search code examples
pythonpandasdataframepandas-groupby

pandas groupby ffill bfill needs intermediate groupby?


I'm trying to paper over missing data in a dataframe by grouping on one column and then flood-filling (bfill().ffill()) subsets of columns inside the groups.

I was previously using

def ffbf(x):
   return x.ffill().bfill()

df[some_cols] = df.groupby(group_key)[some_cols].transform(ffbf)

but transform becomes unbelievably slow even on relatively small dataframes (already several seconds for only 3000x20), so I wanted to see if I could apply ffill and bfill directly to the groups since they're supposed to be cythonized now.

Am I correct in thinking that I need to invoke groupby again in between ffill and bfill because neither method preserves the groupings?

Right now I have

df[some_cols] = df[some_cols].groupby(group_key).ffill().groupby(group_key).bfill()

and I think that it's doing what I want, and it's waaaaaaayyy faster than using transform, but I'm not experienced enough with pandas to be certain, so I figured I'd ask.

[edit] It looks like this change is jumbling my data. Why?


Solution

  • I my opinion here is necessary another groupby with bfill for avoid replace NaNs for only NaNs group from another one.

    For performance is used this code:

    In [205]: %timeit df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)
    443 ms ± 7.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [206]: %timeit df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()
    5.69 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    np.random.seed(785)
    
    N = 10000
    df = pd.DataFrame({'key':np.random.randint(1000, size=N),
                       'A':np.random.choice([1,2,np.nan], size=N),
                       'B':np.random.choice([1,4,np.nan], size=N),
                       'C':np.random.choice([7,0,np.nan], size=N),
                       'D':np.random.choice([7,0,8], size=N)})
    
    df = df.sort_values('key')
    print (df)
    
    def ffbf(x):
       return x.ffill().bfill()
    
    group_key = 'key'
    some_cols = ['A','B','C']
    df1 = df.copy()
    df1[some_cols] = df1.groupby(group_key)[some_cols].transform(ffbf)
    
    #a bit chamgef solution for working in pandas 0.23.1
    df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(group_key).ffill().groupby(group_key).bfill()
    
    print (df.equals(df1))
    True
    

    EDIT: In next pandas versions (test pandas 1.1.1) is possible use:

    df[[group_key] + some_cols] = df[[group_key] + some_cols].groupby(df[group_key]).ffill().groupby(df[group_key]).bfill()