machine-learning statistics data-science analytics random-forest

How to choose the best duration to churn? E.g., this customer will churn within a month

The result would be something similar to:

Customer xxx will churn within n months/weeks.

My data:

Business data about each client [company type, country, contract type, etc.]
Activity data [when they did some activities related to the service...].
A column that shows if they are churned or if they are still active.

My questions are:

a) How can I calculate a good duration for churn? E.g., is a week before they churn is good, or should it be longer? Is there a formula to calculate this?

b) How do I prepare my data for the chosen duration? E.g., if we want a month of churn [probability of the customer xxx churn within the next month], should I use the activity data up until last month? In other words, should I exclude last month's activity data in my model?

I'm going to use a random forest model for this task.

Solution

That is a very common business question. And there are a few approaches to it.

First, and the most obvious way, is to ask the stakeholders. That would be the people who consume the data on the high level. They likely want to take part in defining what churn is since it will effectively define the number of active users, which they then use as a leverage in ads and investor pitches.

Second way is to pave your way with data. Conduct a user-base analysis that quantifies and measures existing gaps in user activities. You want those gaps ordered by size and counted. With that information at hand, you will be able to precisely estimate how many users will be counted twice (as churned and as new) depending on what churn time you choose.

You can obviously mix the two, conduct your analysis of existing activity gaps, bring that data to the executives and decide what would be an ecceptable compromise. Getting your executives a bit closer to data is very useful for future cases when your reported data seemingly contradicts some reliable sources. Having them closer to the base analytics definitions will make it easier for them to understand the nature of conflicts, or even realize that there are no conflicts.

Finally, when you calculate churn, you use the data depending on what question you want to answer, but always oranges to oranges. you may want to just find the global churn from forever till today. It may seem useful, but it's not. A better analysis would be comparing churn YoY. An even better thing to do would be to have a rolling MoM (or not month, depends) churn report, for which you would do something like taking month N-1 of new users and till month N of their activity. That way you'd get a living graph of churn on which you can plot marketing events and new feature releases. You don't want to exclude last month activity in your report. You do want to exclude the last month's registrations from your report. And again, month is a theoretical time here. Let's just replace all "month" with "period". It may be better for you to use two months or a quarter periods. Depends.

More on the churn time, there's no universal perfect time since different services imply different user engagement. Some would imply decades with very little engagement. Say, domain name registrar accounts. Some - tons of engagement, like social media or online games.