I have a date-level promotion data frame that looks something like this:
ID | Date | Promotions | Converted to customer |
---|---|---|---|
1 | 2-Jan | 2 | 0 |
1 | 10-Jan | 3 | 1 |
1 | 14-Jan | 3 | 0 |
2 | 10-Jan | 19 | 1 |
2 | 10-Jan | 8 | 0 |
2 | 10-Jan | 12 | 0 |
Now I want to see what were the number of promotions it took to convert someone into a customer For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID | Date |
---|---|
1 | 5 |
2 | 19 |
I am unable to think of an idea to solve it. Can you please help me?
@Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
Use one groupby
to generate a mask to hide the rows, then one groupby.sum
for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer