calculating revenue churn using pandas

Say I have a pandas dataframe, df:

cust | year | revenue
1 | 2013 | 100
1 | 2013 | 50
2 | 2013 | 70
2 | 2015 | 10
3 | 2016 | 10
3 | 2019 | 65
...

I want to be able to calculate the revenue loss from a customer that stopped doing business. For example, since customer #1 stopped doing business after 2013, we would say there is a $50 churn in 2014.

I want to get the sum of all the churned revenue (lost revenue) by year. The output would be something like:

My current logic is as follows: get revenue from the max year/latest transaction for each individual customer, and sum all those values, grouped by year.

Solution

The problem has been framed as to induce one to consider the "churn" as a binary condition: either a customer continues to do business with your company or not. However, you still want to know how much money you will lose when a customer stops doing business with your company, which then changes the output from a binary to a numeric variable.

I have calculated the churn by customer year on year and then aggregated all changes in revenue per year for the whole company, which I think will solve your issue.

I've used this synthetic data:

import numpy as np
import pandas as pd

# Sets random seed
np.random.seed(42)

# Sample size
size= 10**3

# Creates customers series
unique_customers = np.arange(1, 51)

customers = [unique_customers[i] for i in
             np.random.randint(low=0, high=len(unique_customers), size=size)]

# Creates date series
unique_dates = pd.date_range(start="2013-01-01", end="2017-12-31", freq="D")

unique_years = unique_dates.year.unique()

dates = [unique_dates[i] for i in
         np.random.randint(low=0, high=len(unique_dates), size=size)]

# Creates revenues series
unique_revenues = [100, 50, 70, 10, 65]

revenue = [unique_revenues[i] for i in
           np.random.randint(low=0, high=len(unique_revenues), size=size)]

# Creates Pandas DataFrame
data = (pd.DataFrame({'Customer': customers,
                     'Date': dates,
                     'Revenue': revenue})
        .set_index('Date')
        .sort_index())

# Randomly sets customers revenues to zero
rem_customers = set([unique_customers[i] for i in
                     np.random.randint(low=0, high=len(unique_customers), size=10)])

for cust in rem_customers:
    rem_year = unique_years[np.random.randint(low=2, high=len(unique_years), size=1)].values[0]
    data.loc[(data.index.year >= rem_year) & (data['Customer'] == cust), "Revenue"] = np.nan

This is how the data set looks like:

            Customer  Revenue
Date                         
2013-01-01        36    100.0
2013-01-03         1     10.0
2013-01-03        47     50.0
2013-01-04        28     10.0
2013-01-04        25     65.0

I have deliberately set some customers' revenues to zero after a given date, to illustrate the problem.

You can use groupby() to define the levels of aggregation, diff() to calculate the lost revenue at the given level of aggregation and resample() to change the time series frequency from daily to yearly, for instance.

# Calculates revenue by customer
revenue_by_customer = data.groupby([pd.Grouper(freq='AS'),'Customer']).sum()

# Calculates the change in revenue by customer year on year
diff_revenue_by_customer = (
revenue_by_customer.groupby(['Customer'])
.diff(1)
.rename(columns={'Revenue':'Revenue_change'})
)

# Calculates total change in revenue year on year
diff_revenue_per_year = diff_revenue_by_customer.droplevel(1).resample('AS').sum()

Customer #40 has stopped doing business with the company after 2014 and here is how their records look:

revenue_by_customer.xs(40, level=1, drop_level=False)

>>                      Revenue
>> Date       Customer         
>> 2013-01-01 40          285.0
>> 2014-01-01 40          195.0
>> 2015-01-01 40            0.0
>> 2016-01-01 40            0.0
>> 2017-01-01 40            0.0

diff_revenue_by_customer.xs(40, level=1, drop_level=False)

>>                       Revenue_change
>> Date       Customer         
>> 2013-01-01 40            NaN
>> 2014-01-01 40          -90.0
>> 2015-01-01 40         -195.0
>> 2016-01-01 40            0.0
>> 2017-01-01 40            0.0

When we add up the change in revenue by year, this is the resulting table:

diff_revenue_per_year.head()

>>             Revenue_change
>> Date               
>> 2013-01-01      0.0
>> 2014-01-01  -2490.0
>> 2015-01-01   2255.0
>> 2016-01-01  -1545.0
>> 2017-01-01  -1305.0

You can also add up only the lost revenue year on year, with the following code:

lost_revenue_per_year = (
    diff_revenue_by_customer
    .loc[diff_revenue_by_customer['Revenue_change']<0]
    .droplevel(1)
    .resample('AS')
    .sum()
    .rename(columns={'Revenue_change':'Lost_revenue'})
    )

lost_revenue_per_year.head()

>>             Lost_revenue
>> Date                    
>> 2014-01-01       -4935.0
>> 2015-01-01       -2725.0
>> 2016-01-01       -4290.0
>> 2017-01-01       -3995.0