I have a DataFrame:
>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 22.0 23.0 24.0 25.0
and a Series that specifies the number of top N values I want from each row
>>> n_max = pd.Series([2,3,4])
What is Panda's way of using df
and n_max
to find the largest N elements of each (breaking ties with a random pick, just as .nlargest()
would do)?
The desired output is
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:
>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
a b c d e
row1 1.0 2.0 NaN 4.0 5.0
row2 11.0 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0
But the goal, again, is to have row-specific N
. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank()
with a mask but tie breaking doesn't work there...
Based on @ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:
>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
a b c d e
row1 NaN NaN NaN 4.0 5.0
row2 NaN 12.0 13.0 14.0 NaN
row3 22.0 NaN 23.0 24.0 25.0