Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?

I have a date-level promotion data frame that looks something like this:

ID	Date	Promotions	Converted to customer
1	2-Jan	2	0
1	10-Jan	3	1
1	14-Jan	3	0
2	10-Jan	19	1
2	10-Jan	8	0
2	10-Jan	12	0

Now I want to see what were the number of promotions it took to convert someone into a customer For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.

Eg.

ID	Date
1	5
2	19

I am unable to think of an idea to solve it. Can you please help me?

@Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).

Solution

Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:

mask = (df.groupby('ID', group_keys=False)['Converted to customer']
          .apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
       )

out = df[~mask].groupby('ID')['Promotions'].sum()

Output:

ID
1     5
2    19
Name: Promotions, dtype: int64

Alternative output:

df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})

Output:

   ID  Number
0   1       5
1   2      19

If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:

mask = (df.groupby('ID', group_keys=False)['Converted to customer']
          .apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
       )

out = (df[~mask]
       .groupby('ID', as_index=False)
       .agg(**{'Number': ('Promotions', 'sum'),
               'Converted': ('Converted to customer', 'max')
              })
      )

Output:

   ID  Number  Converted
0   1       5          1
1   2      19          1
2   3      39          0

Alternative input:

   ID    Date  Promotions  Converted to customer
0   1   2-Jan           2                      0
1   1  10-Jan           3                      1
2   1  14-Jan           3                      0
3   2  10-Jan          19                      1
4   2  10-Jan           8                      0
5   2  10-Jan          12                      0
6   3  10-Jan          19                      0 # this group has
7   3  10-Jan           8                      0 # no conversion
8   3  10-Jan          12                      0 # to customer