Search code examples
pythonpython-3.xpandaspandas-groupbynan

Pandas Top N% of values within each group while ignoring missing values


I have a df and I want, for each line (which correspond to a month), to have a list of 50% of the highest values of "B" in that line.

month A B
1994-07 A 50
1994-07 B 60
1994-07 C 70
1994-07 D 80
1994-07 E NAN
1994-07 F NAN
1994-08 A 90
1994-08 B 60
1994-08 C 70
1994-08 D 95
1994-08 E 100
1994-08 F 110
1994-08 G NAN

For July/1994 I only have 4 columns "B" filled with values, so 50% will be the 2 highest MV. For the month after, I have 6 stocks, which gives me 3 highest values:

month A B
1994-07 C 70
1994-07 D 80
1994-08 D 95
1994-08 E 100
1994-08 F 110

I have tried:

df = df.groupby(pd.Grouper(freq="M")).apply(lambda g: g.nsmallest(len(g)//2, 'B'))

However, it does not ignore "NAN" and count it as a number. For example, for July/1994 it counts 6 values, and so it returns me the 3 (50% of 6) the highest values from the month. Instead, it should count that there are 4 values and return me the 2 highest.


Solution

  • IIUC, you want to use g['B'].count() as count ignores NaNs:

    (df
     .groupby('month')
     .apply(lambda g: g.nlargest(g['B'].count()//2, columns='B'))
     .droplevel(0).sort_index()
    )
    

    Alternatively, dropna first:

    (df
     .dropna(subset='B')
     .groupby('month')
     .apply(lambda g: g.nlargest(len(g)//2, 'B'))
     .droplevel(0).sort_index()
    )
    

    output:

          month  A      B
    2   1994-07  C   70.0
    3   1994-07  D   80.0
    9   1994-08  D   95.0
    10  1994-08  E  100.0
    11  1994-08  F  110.0