I have a dataframe grouped by issue_ids where i want to apply a custom function. The grouped dataframe looks as follows
import pandas as pd
import numpy as np
sub_test=pd.DataFrame(columns=['issue_id','step','conversion_rate'],data=[['01-abc-234',0,0.45],['01-abc-234',1,0.35],['01-abc-234',2,0.15],['01-abc-234',3,1],['02-abc-234',0,0.05],['02-abc-234',1,0.15],['02-abc-234',2,0.65],['02-abc-234',3,1]])
sub_test.info()
I want to group by issue id and apply the following function for each grouped dataframe
def calculate_conversion_step(row, df):
if row == 0:
return np.prod(df.loc[df['step'].isin([1, 2]), 'conversion_rate'])
elif row == 1:
return np.prod(df.loc[df['step'] == 2, 'conversion_rate'])
else:
return 1
Basically, what i am doing here is iterating through each dataframe for individual issue ids and applying the aforementioned function to each row of the filtered dataframe. I used .apply() but my dataframe is too large to function well with apply.
final=pd.DataFrame()
for issue_id in sub_test['issue_id'].unique():
int_df = sub_test[sub_test['issue_id'] == issue_id]
# Apply the 'calculate_conversion_step' function to calculate 'conversion_step' for each issue
int_df['conversion_step'] = int_df['step'].apply(lambda x: calculate_conversion_step(x, int_df))
# Concatenate the results for each issue
final = pd.concat([final, int_df])
Is there anyway i can make it faster?
import numpy as np
cond0, cond1, cond2 = sub_test['step'].eq(0), sub_test['step'].eq(1), sub_test['step'].eq(2)
s1 = sub_test.groupby('issue_id')['conversion_rate'].transform(lambda x: x.where(cond1 | cond2).prod())
s2 = sub_test.groupby('issue_id')['conversion_rate'].transform(lambda x: x.where(cond2).sum())
sub_test['conversion_step'] = np.select([cond0, cond1], [s1, s2], 1)
output:
issue_id step conversion_rate conversion_step
0 01-abc-234 0 0.45 0.0525
1 01-abc-234 1 0.35 0.1500
2 01-abc-234 2 0.15 1.0000
3 01-abc-234 3 1.00 1.0000
4 02-abc-234 0 0.05 0.0975
5 02-abc-234 1 0.15 0.6500
6 02-abc-234 2 0.65 1.0000
7 02-abc-234 3 1.00 1.0000