Search code examples
python-3.xpandasdataframegroup-bywhere-clause

pandas fill 0s with mean based on rows that match a condition in another column


I have a dataframe like below in which I need to replace the 0s with the mean of the rows where the parent_key matches the self_key.

Input DataFrame: df= pd.DataFrame ({'self_key':['a','b','c','d','e','e','e','f','f','f'],'parent_key':[np.nan,'a','b','b','c','c','c','d','d','d'], 'value':[0,0,0,0,4,6,14,12,8,22],'level':[1,2,3,3,4,4,4,4,4,4]})

The row 3 has self_key of 'd' so I would need to replace its 0 value in column 'value' with the mean of rows 7,8,9 to fill with the correct value of 14. Since the lower levels feed into the higher levels I would need to do it from lowest level to highest to fill out the dataframe as well but when I do the below code it doesn't work and I get the error "ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional". How can I fill in the 0s with the means from lowest level to highest?

df['value']=np.where((df['value']==0) & (df['level']==3), df['value'].groupby(df.where(df['parent_key']==df['self_key'])).transform('mean'), df['value'])

Input

  self_key parent_key  value  level
0        a        NaN      0      1
1        b          a      0      2
2        c          b      0      3
3        d          b      0      3
4        e          c      4      4
5        e          c      6      4
6        e          c     14      4
7        f          d     12      4
8        f          d      8      4
9        f          d     22      4

My approach is to repeat the above code 3 times and change the level from 3 to 2 to 1, but its not working for even level 3.

Expected Ouput:

  self_key parent_key  value  level
0        a        NaN     11      1
1        b          a     11      2
2        c          b      8      3
3        d          b     14      3
4        e          c      4      4
5        e          c      6      4
6        e          c     14      4
7        f          d     12      4
8        f          d      8      4
9        f          d     22      4

Solution

  • If I understand your problem correctly, you are trying to compute mean in a bottom-up fashion by filtering dataframe on certain keys. If so, then following should solve it:

    for l in range(df["level"].max()-1, 0, -1):
      df_sub = df[(df["level"] == l) & (df["value"] == 0)]
      self_keys = df_sub["self_key"].tolist()
      for k in self_keys:
        df.loc[df_sub[df_sub["self_key"] == k].index, "value"] = df[df["parent_key"] == k]["value"].mean()
    
    [Out]:
      self_key parent_key  value  level
    0        a                11      1
    1        b          a     11      2
    2        c          b      8      3
    3        d          b     14      3
    4        e          c      4      4
    5        e          c      6      4
    6        e          c     14      4
    7        f          d     12      4
    8        f          d      8      4
    9        f          d     22      4