Search code examples
pysparkgroup-bypivot-table

Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer?


I have a date-level promotion data frame that looks something like this:

ID Date Promotions Converted to customer
1 2-Jan 2 0
1 10-Jan 3 1
1 14-Jan 3 0
2 10-Jan 19 1
2 10-Jan 8 0
2 10-Jan 12 0

Now I want to see what were the number of promotions it took to convert someone into a customer For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.

Eg.

ID Date
1 5
2 19

I am unable to think of an idea to solve it. Can you please help me?

@Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).


Solution

  • Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:

    mask = (df.groupby('ID', group_keys=False)['Converted to customer']
              .apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
           )
    
    out = df[~mask].groupby('ID')['Promotions'].sum()
    

    Output:

    ID
    1     5
    2    19
    Name: Promotions, dtype: int64
    

    Alternative output:

    df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
    

    Output:

       ID  Number
    0   1       5
    1   2      19
    

    If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:

    mask = (df.groupby('ID', group_keys=False)['Converted to customer']
              .apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
           )
    
    out = (df[~mask]
           .groupby('ID', as_index=False)
           .agg(**{'Number': ('Promotions', 'sum'),
                   'Converted': ('Converted to customer', 'max')
                  })
          )
    

    Output:

       ID  Number  Converted
    0   1       5          1
    1   2      19          1
    2   3      39          0
    

    Alternative input:

       ID    Date  Promotions  Converted to customer
    0   1   2-Jan           2                      0
    1   1  10-Jan           3                      1
    2   1  14-Jan           3                      0
    3   2  10-Jan          19                      1
    4   2  10-Jan           8                      0
    5   2  10-Jan          12                      0
    6   3  10-Jan          19                      0 # this group has
    7   3  10-Jan           8                      0 # no conversion
    8   3  10-Jan          12                      0 # to customer