Search code examples
pythonpandasdataframeperformanceprocessing-efficiency

Most efficient way for finding 3 rows with maximum value in column?


Lets us say there is a dataframe df

Name  Balance
A     1000
B     5000
C     3000
D     6000
E     2000
F     5000

I am looking for an approach through which I can get three rows with highest balances among all.

df['balance'].get_indices_max(n=3) # where is no. of results required

Output when these indices will be used to get rows:

D 6000
F 5000
B 5000

UPDATE : EXTRA NOTES REGARDING THE ACCEPTED ANSWER

Possible "keep" values -

first : prioritize the first occurrence(s)

last : prioritize the last occurrence(s)

all : do not drop any duplicates, even it means selecting more than n items.

Solution

  • Answer

    df = Df({"Name":list("ABCDEF"), "Balance":[1000,5000,3000,6000,2000,5000]})
    index = df["Balance"].nlargest(3).index
    df.loc[index]
    

    Output

      Name  Balance
    3    D     6000
    1    B     5000
    5    F     5000
    

    Attantion

    1. Performant

    The columns that are not specified are returned as well, but not used for ordering. This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

    1. nlargest(3, keep='all')

    keep{‘first’, ‘last’, ‘all’}, default ‘first’

    When using keep='all', all duplicate items are maintained

    Example

    df = Df({"Name":list("ABCDEFX"), "Balance":[1000,5000,3000,6000,2000,5000,5000]})
    index = df["Balance"].nlargest(3, keep='all').index
    df.loc[index]
    
      Name  Balance
    3    D     6000
    1    B     5000
    5    F     5000
    6    X     5000
    

    Reference