Search code examples
pythonpandasapplypandasql

Pandas speedup apply on max()


I'd like to know how I may speed up the following function, e.g. with Cython?

def groupby_maxtarget(df, group, target):
    df_grouped = df.groupby([group]).apply(lambda row: row[row[target]==row[target].max()])
    return df_grouped

This function groups by a single column and returns all rows where each group's target achieves its max value; the resulting dataframe is returned.

If df has, say, 500K rows, the above function takes about 5 minutes to complete on my computer. This performance is OK, but I have data with over 10 million rows... Of course, one solution would be to execute the above as a query within a SQL server and have Python retrieve the result, but I was hoping for an SQL-free, Pythonic solution.


Solution

  • In [22]: pd.set_option('max_rows',20)
    
    In [33]: N = 10000000
    
    In [34]: df = DataFrame({'A' : np.random.randint(0,100,size=N), 'B' : np.random.randint(0,100,size=N)})
    
    In [35]: df[df.groupby('A')['B'].transform('max') == df['B']]
    Out[35]: 
              A   B
    161      30  99
    178      53  99
    264      58  99
    337      96  99
    411      44  99
    428      85  99
    500      84  99
    598      98  99
    602      24  99
    684      31  99
    ...      ..  ..
    9999412  25  99
    9999482  35  99
    9999502   6  99
    9999537  24  99
    9999579  65  99
    9999680  32  99
    9999713  74  99
    9999886  90  99
    9999887  57  99
    9999991  45  99
    
    [100039 rows x 2 columns]
    
    In [36]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
    1 loops, best of 3: 1.85 s per loop
    

    Note that this is proportional to the number of groups, but the coefficient is pretty small. Eg. I do 100x groups and speed only 2x. Transform is quite efficient as it broadcasts.

    In [8]: G = 100
    
    In [9]: df = DataFrame({'A' : np.random.randint(0,G,size=N), 'B' : np.random.randint(0,G,size=N)})
    
    In [10]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
    1 loops, best of 3: 1.86 s per loop
    
    In [11]: G = 10000
    
    In [12]: df = DataFrame({'A' : np.random.randint(0,G,size=N), 'B' : np.random.randint(0,G,size=N)})
    
    In [13]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
    1 loops, best of 3: 3.95 s per loop