Search code examples
pythonpandaspandas-groupbynanfillna

How to fill NAs with median of means of 2-column groupby in pandas?


Working with pandas, I have a dataframe with two hierarchies A and B, where B can be NaN, and I want to fill some NaNs in D in a particular way:

In the example below, A has "B-subgroups" where there are no values at all for D (e.g. (1, 1)), while A also has values for D in other subgroups (e.g. (1, 3)).

Now I want to get the mean of each subgroup (120, 90 and 75 for A==1), find the median of these means (90 for A==1) and use this median to fill NaNs in the other subgroups of A==1.

Groups like A==2, where there are only NaNs for D, should not be filled.

Groups like A==3, where there are some values for D but only rows with B being NaN have NaN in D, should not be filled if possible (I intend to fill these later with the mean of all values of D of their whole A groups).

Example df:

d = {'A': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3], 
    'B': [1, 2, 3, 3, 4, 5, 6, 1, 1, np.NaN, np.NaN], 
    'D': [np.NaN, np.NaN, 120, 120, 90, 75, np.NaN, np.NaN, 60, 50, np.NaN]}
df = pd.DataFrame(data=d)
A       B       D
1       1       NaN
1       2       NaN
1       3       120
1       3       120
1       4       90
1       5       75
1       6       NaN
2       1       NaN
3       1       60
3       NaN     50
3       NaN     NaN

Expected result:

A       B       D
1       1       90
1       2       90
1       3       120
1       3       120
1       4       90
1       5       75
1       6       90
2       1       NaN
3       1       60
3       NaN     50
3       NaN     NaN

With df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median') or .median() I seem to get the right values, but using

df['D'] = df['D'].fillna(
    df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
)

does not seem to change any values in D.

Any help is greatly appreciated, I've been stuck on this for a while and cannot find any solution anywhere.


Solution

  • Your first step is correct. After that we use Series.map to map the correct medians to each group in column A.

    Finally we use np.where to conditionally fill in column D if B is not NaN:

    medians = df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
    df['D'] = np.where(df['B'].notna(),                        # if B is not NaN
                       df['D'].fillna(df['A'].map(medians)),   # fill in the median
                       df['D'])                                # else keep the value of column D
    
        A    B      D
    0   1 1.00  90.00
    1   1 2.00  90.00
    2   1 3.00 120.00
    3   1 3.00 120.00
    4   1 4.00  90.00
    5   1 5.00  75.00
    6   1 6.00  90.00
    7   2 1.00    nan
    8   3 1.00  60.00
    9   3  nan  50.00
    10  3  nan    nan