Search code examples
pandasdataframemaxrowwise

Pandas dataframe finding largest N elements of each row with row-specific N


I have a DataFrame:

>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0  22.0  23.0  24.0  25.0

and a Series that specifies the number of top N values I want from each row

>>> n_max = pd.Series([2,3,4])

What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?

The desired output is

         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:

>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...


Solution

  • Based on @ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:

    >>> n_max.index = df.index
    >>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
    >>> selected = df_rank.le(n_max, axis=0)
    >>> df[selected]
             a     b     c     d     e
    row1   NaN   NaN   NaN   4.0   5.0
    row2   NaN  12.0  13.0  14.0   NaN
    row3  22.0   NaN  23.0  24.0  25.0