Search code examples
pythonpandastime-seriesseries

calculating revenue churn using pandas


Say I have a pandas dataframe, df:

cust | year | revenue
1 | 2013 | 100
1 | 2013 | 50
2 | 2013 | 70
2 | 2015 | 10
3 | 2016 | 10
3 | 2019 | 65
... 

I want to be able to calculate the revenue loss from a customer that stopped doing business. For example, since customer #1 stopped doing business after 2013, we would say there is a $50 churn in 2014.

I want to get the sum of all the churned revenue (lost revenue) by year. The output would be something like:

YEAR
2013 0
2014 150
2015 0
2016 10
2017 0
2018 0
2019 0
2020 65

My current logic is as follows: get revenue from the max year/latest transaction for each individual customer, and sum all those values, grouped by year.


Solution

  • The problem has been framed as to induce one to consider the "churn" as a binary condition: either a customer continues to do business with your company or not. However, you still want to know how much money you will lose when a customer stops doing business with your company, which then changes the output from a binary to a numeric variable.

    I have calculated the churn by customer year on year and then aggregated all changes in revenue per year for the whole company, which I think will solve your issue.

    I've used this synthetic data:

    import numpy as np
    import pandas as pd
    
    # Sets random seed
    np.random.seed(42)
    
    # Sample size
    size= 10**3
    
    # Creates customers series
    unique_customers = np.arange(1, 51)
    
    customers = [unique_customers[i] for i in
                 np.random.randint(low=0, high=len(unique_customers), size=size)]
    
    # Creates date series
    unique_dates = pd.date_range(start="2013-01-01", end="2017-12-31", freq="D")
    
    unique_years = unique_dates.year.unique()
    
    dates = [unique_dates[i] for i in
             np.random.randint(low=0, high=len(unique_dates), size=size)]
    
    # Creates revenues series
    unique_revenues = [100, 50, 70, 10, 65]
    
    revenue = [unique_revenues[i] for i in
               np.random.randint(low=0, high=len(unique_revenues), size=size)]
    
    # Creates Pandas DataFrame
    data = (pd.DataFrame({'Customer': customers,
                         'Date': dates,
                         'Revenue': revenue})
            .set_index('Date')
            .sort_index())
    
    # Randomly sets customers revenues to zero
    rem_customers = set([unique_customers[i] for i in
                         np.random.randint(low=0, high=len(unique_customers), size=10)])
    
    for cust in rem_customers:
        rem_year = unique_years[np.random.randint(low=2, high=len(unique_years), size=1)].values[0]
        data.loc[(data.index.year >= rem_year) & (data['Customer'] == cust), "Revenue"] = np.nan
    

    This is how the data set looks like:

                Customer  Revenue
    Date                         
    2013-01-01        36    100.0
    2013-01-03         1     10.0
    2013-01-03        47     50.0
    2013-01-04        28     10.0
    2013-01-04        25     65.0
    

    I have deliberately set some customers' revenues to zero after a given date, to illustrate the problem.

    You can use groupby() to define the levels of aggregation, diff() to calculate the lost revenue at the given level of aggregation and resample() to change the time series frequency from daily to yearly, for instance.

    # Calculates revenue by customer
    revenue_by_customer = data.groupby([pd.Grouper(freq='AS'),'Customer']).sum()
    
    # Calculates the change in revenue by customer year on year
    diff_revenue_by_customer = (
    revenue_by_customer.groupby(['Customer'])
    .diff(1)
    .rename(columns={'Revenue':'Revenue_change'})
    )
    
    # Calculates total change in revenue year on year
    diff_revenue_per_year = diff_revenue_by_customer.droplevel(1).resample('AS').sum()
    

    Customer #40 has stopped doing business with the company after 2014 and here is how their records look:

    revenue_by_customer.xs(40, level=1, drop_level=False)
    
    >>                      Revenue
    >> Date       Customer         
    >> 2013-01-01 40          285.0
    >> 2014-01-01 40          195.0
    >> 2015-01-01 40            0.0
    >> 2016-01-01 40            0.0
    >> 2017-01-01 40            0.0
    
    diff_revenue_by_customer.xs(40, level=1, drop_level=False)
    
    >>                       Revenue_change
    >> Date       Customer         
    >> 2013-01-01 40            NaN
    >> 2014-01-01 40          -90.0
    >> 2015-01-01 40         -195.0
    >> 2016-01-01 40            0.0
    >> 2017-01-01 40            0.0
    

    When we add up the change in revenue by year, this is the resulting table:

    diff_revenue_per_year.head()
    
    >>             Revenue_change
    >> Date               
    >> 2013-01-01      0.0
    >> 2014-01-01  -2490.0
    >> 2015-01-01   2255.0
    >> 2016-01-01  -1545.0
    >> 2017-01-01  -1305.0
    

    You can also add up only the lost revenue year on year, with the following code:

    lost_revenue_per_year = (
        diff_revenue_by_customer
        .loc[diff_revenue_by_customer['Revenue_change']<0]
        .droplevel(1)
        .resample('AS')
        .sum()
        .rename(columns={'Revenue_change':'Lost_revenue'})
        )
    
    lost_revenue_per_year.head()
    
    >>             Lost_revenue
    >> Date                    
    >> 2014-01-01       -4935.0
    >> 2015-01-01       -2725.0
    >> 2016-01-01       -4290.0
    >> 2017-01-01       -3995.0