Search code examples
pythongroup-byindexingpandasargmax

How to get rows in pandas data frame, with maximal values in a column and keep the original index?


I have a pandas data frame. In the first column it can have the same value several times (in other words, the values in the first column are not unique).

Whenever I have several rows that contain the same value in the first column, I would like to leave only those that have maximal value in the third column. I almost found a solution:

import pandas

ls = []
ls.append({'c1':'a', 'c2':'a', 'c3':1})
ls.append({'c1':'a', 'c2':'c', 'c3':3})
ls.append({'c1':'a', 'c2':'b', 'c3':2})
ls.append({'c1':'b', 'c2':'b', 'c3':10})
ls.append({'c1':'b', 'c2':'c', 'c3':12})
ls.append({'c1':'b', 'c2':'a', 'c3':7})

df = pandas.DataFrame(ls, columns=['c1','c2','c3'])
print df
print '--------------------'
print df.groupby('c1').apply(lambda df:df.irow(df['c3'].argmax()))

As a result I get:

  c1 c2  c3
0  a  a   1
1  a  c   3
2  a  b   2
3  b  b  10
4  b  c  12
5  b  a   7
--------------------
   c1 c2  c3
c1          
a   a  c   3
b   b  c  12

My problem is that, I do not want to have c1 as index. What I want to have is following:

  c1 c2  c3
1  a  c   3
4  b  c  12

Solution

  • When calling df.groupby(...).apply(foo), the type of object returned by foo affects the way the results are melded together.

    If you return a Series, the index of the Series become columns of the final result, and the groupby key becomes the index (a bit of a mind-twister).

    If instead you return a DataFrame, the final result uses the index of the DataFrame as index values, and the columns of the DataFrame as columns (very sensible).

    So, you can arrange for the type of output you desire by converting your Series into a DataFrame.

    With Pandas 0.13 you can use the to_frame().T method:

    def maxrow(x, col):
        return x.loc[x[col].argmax()].to_frame().T
    
    result = df.groupby('c1').apply(maxrow, 'c3')
    result = result.reset_index(level=0, drop=True)
    print(result)
    

    yields

      c1 c2  c3
    1  a  c   3
    4  b  c  12
    

    In Pandas 0.12 or older, the equivalent would be:

    def maxrow(x, col):
        ser = x.loc[x[col].idxmax()]
        df = pd.DataFrame({ser.name: ser}).T
        return df
    

    By the way, behzad.nouri's clever and elegant solution is quicker than mine for small DataFrames. The sort lifts the time complexity from O(n) to O(n log n) however, so it becomes slower than the to_frame solution shown above when applied to larger DataFrames.

    Here is how I benchmarked it:

    import pandas as pd
    import numpy as np
    import timeit
    
    
    def reset_df_first(df):
        df2 = df.reset_index()
        result = df2.groupby('c1').apply(lambda x: x.loc[x['c3'].idxmax()])
        result.set_index(['index'], inplace=True)
        return result
    
    def maxrow(x, col):
        result = x.loc[x[col].argmax()].to_frame().T
        return result
    
    def using_to_frame(df):
        result = df.groupby('c1').apply(maxrow, 'c3')
        result.reset_index(level=0, drop=True, inplace=True)
        return result
    
    def using_sort(df):
        return df.sort('c3').groupby('c1', as_index=False).tail(1)
    
    
    for N in (100, 1000, 2000):
        df = pd.DataFrame({'c1': {0: 'a', 1: 'a', 2: 'a', 3: 'b', 4: 'b', 5: 'b'},
                           'c2': {0: 'a', 1: 'c', 2: 'b', 3: 'b', 4: 'c', 5: 'a'},
                           'c3': {0: 1, 1: 3, 2: 2, 3: 10, 4: 12, 5: 7}})
    
        df = pd.concat([df]*N)
        df.reset_index(inplace=True, drop=True)
    
        timing = dict()
        for func in (reset_df_first, using_to_frame, using_sort):
            timing[func] = timeit.timeit('m.{}(m.df)'.format(func.__name__),
                                  'import __main__ as m ',
                                  number=10)
    
        print('For N = {}'.format(N))
        for func in sorted(timing, key=timing.get):
            print('{:<20}: {:<0.3g}'.format(func.__name__, timing[func]))
        print
    

    yields

    For N = 100
    using_sort          : 0.018
    using_to_frame      : 0.0265
    reset_df_first      : 0.0303
    
    For N = 1000
    using_to_frame      : 0.0358    \
    using_sort          : 0.036     / this is roughly where the two methods cross over in terms of performance
    reset_df_first      : 0.0432
    
    For N = 2000
    using_to_frame      : 0.0457
    reset_df_first      : 0.0523
    using_sort          : 0.0569
    

    (reset_df_first was another possibility I tried.)