I'd like to know how I may speed up the following function, e.g. with Cython?
def groupby_maxtarget(df, group, target):
df_grouped = df.groupby([group]).apply(lambda row: row[row[target]==row[target].max()])
return df_grouped
This function groups by a single column and returns all rows where each group's target achieves its max value; the resulting dataframe is returned.
If df has, say, 500K rows, the above function takes about 5 minutes to complete on my computer. This performance is OK, but I have data with over 10 million rows... Of course, one solution would be to execute the above as a query within a SQL server and have Python retrieve the result, but I was hoping for an SQL-free, Pythonic solution.
In [22]: pd.set_option('max_rows',20)
In [33]: N = 10000000
In [34]: df = DataFrame({'A' : np.random.randint(0,100,size=N), 'B' : np.random.randint(0,100,size=N)})
In [35]: df[df.groupby('A')['B'].transform('max') == df['B']]
Out[35]:
A B
161 30 99
178 53 99
264 58 99
337 96 99
411 44 99
428 85 99
500 84 99
598 98 99
602 24 99
684 31 99
... .. ..
9999412 25 99
9999482 35 99
9999502 6 99
9999537 24 99
9999579 65 99
9999680 32 99
9999713 74 99
9999886 90 99
9999887 57 99
9999991 45 99
[100039 rows x 2 columns]
In [36]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
1 loops, best of 3: 1.85 s per loop
Note that this is proportional to the number of groups, but the coefficient is pretty small. Eg. I do 100x groups and speed only 2x. Transform is quite efficient as it broadcasts.
In [8]: G = 100
In [9]: df = DataFrame({'A' : np.random.randint(0,G,size=N), 'B' : np.random.randint(0,G,size=N)})
In [10]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
1 loops, best of 3: 1.86 s per loop
In [11]: G = 10000
In [12]: df = DataFrame({'A' : np.random.randint(0,G,size=N), 'B' : np.random.randint(0,G,size=N)})
In [13]: %timeit df[df.groupby('A')['B'].transform('max') == df['B']]
1 loops, best of 3: 3.95 s per loop