Search code examples
sqlmysqlregressionanalytics

Identify usage trend for customers


I am trying to answer questions using weekly usage data about which customers are increasing, reducing or sustaining their usage of our service. This is so we can identify which customers are likely to "churn" based on their usage.

I have a MySQL 5.7.12 table, called usage_data

In this table are rows comprised of the following columns, which represent key metrics capture on a weekly basis:

  • company (A unique ID for the customer)
  • start (A timestamp, in milliseconds, for when the subscription started)
  • end (A timestamp, in milliseconds, for when the subscription ended)
  • weekNumber (The number of weeks, relative to the start of the subscription, that have passed)
  • weekEnding (The date of the Sunday for the week when the data was captured)
  • numberOfVisits (one of a handful of metrics that might be used, integers from 0)

So a company may have 1 to many hundreds of rows, with each row representing a different week during their subscription. And each week, the data capture runs and adds new rows for all the active companies.

What I cannot work out is what the right way to query this dataset is in order to identify:

  • Customers whose usage (e.g. numberOfVisits per weekNumber) is reducing
  • Customers whose usage has increased
  • Customers whose usage has sustained

I am wondering whether I need to perform a "post processing" step, where I calculate the regression for each company after the data has been collected each week.

The table is created using this statement:

CREATE TABLE `usage_data` (
  `company` bigint(20) NOT NULL,
  `sub` varchar(255) NOT NULL,
  `start` bigint(20) NOT NULL,
  `end` bigint(20) NOT NULL,
  `weekNumber` int(11) NOT NULL,
  `weekEnding` varchar(45) NOT NULL,
  `numberOfVisits` int(11) DEFAULT NULL,
  `searches` int(11) DEFAULT NULL,
  `uniqueContentNames` int(11) DEFAULT NULL,
  `uniqueContentPieces` int(11) DEFAULT NULL,
  `uniqueContentSubTypes` int(11) DEFAULT NULL,
  `uniqueContentTypes` int(11) DEFAULT NULL,
  `updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  KEY `idx_usage_data_company_sub_start_end_weekNumber` (`company`,`sub`,`start`,`end`,`weekNumber`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

A set of example data would be:

company,start,end,weekNumber,weekEnding,numberOfVisits,searches,uniqueContentNames,uniqueContentPieces,uniqueContentSubTypes,uniqueContentTypes,updated
1,1662595200000,1694131199000,43,2023-07-02,1,2,1,1,1,1,"2023-07-03 19:26:28"
2,1661990400000,1693526399000,44,2023-07-02,5,7,7,8,4,4,"2023-07-03 19:26:28"
3,1650758400000,1745452799000,62,2023-07-02,2,0,2,2,2,2,"2023-07-03 19:26:28"
4,1664841600000,1696377599000,39,2023-07-02,1,1,1,1,1,1,"2023-07-03 19:26:28"
5,1680739200000,1712361599000,13,2023-07-02,6,12,5,5,5,5,"2023-07-03 19:26:28"

I have tried different SQL queries, such as the following for "reducing usage" but each variation I come up with has some deficiency (either too many results with usage that is largely consistent, or too few missing customers that had high usage metrics and then have gone to zero):

SELECT usage_data.company, usage_subscriptions.name, usage_subscriptions.level, usage_subscriptions.category, usage_subscriptions.country, usage_subscriptions.start, usage_subscriptions.end 
FROM usage_data 
JOIN usage_subscriptions ON usage_data.company = usage_subscriptions.company
WHERE numberOfVisits > (SELECT avg(numberOfVisits) * 0.5 FROM usage_data WHERE usage_data.company = usage_data.company)
GROUP BY usage_data.company  
HAVING MAX(numberOfVisits) > (SELECT avg(numberOfVisits) * 0.5 FROM usage_data WHERE usage_data.company = usage_data.company) 
AND MIN(numberOfVisits) < (SELECT avg(numberOfVisits) * 0.75 FROM usage_data WHERE usage_data.company = usage_data.company) 

I looked at other StackOverflow questions, and could find reference to calculating linear regression on a single entity level, but not how to query a dataset with multiple entities.


Solution

  • One option could be to track usage per customer by means of a moving average - which should allow for smoothing out of inconsistencies in the data.

    For instance, a customer could usually have a high number of visits per week, but the last week might show 0 visits due to an anomaly - e.g. the customer has no internet access, is on holiday, etc.

    Here is a hypothetical example below (I have created hypothetical figures of number of visits per customer for illustration purposes).

    select company, weekending, numberofvisits, avg(numberofvisits) OVER
    (PARTITION BY company ORDER BY weekending ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
    FROM usage_data order by company, weekending;   
    
     company | weekending | numberofvisits |          avg           
    ---------+------------+----------------+------------------------
           1 | 2023-07-02 |              1 | 1.00000000000000000000
           1 | 2023-07-09 |              0 | 0.50000000000000000000
           1 | 2023-07-15 |              0 | 0.33333333333333333333
           2 | 2023-07-09 |              7 |     7.0000000000000000
           2 | 2023-07-16 |              9 |     8.0000000000000000
           2 | 2023-07-22 |              5 |     7.0000000000000000
           3 | 2023-07-09 |              2 |     2.0000000000000000
           3 | 2023-07-15 |              4 |     3.0000000000000000
           3 | 2023-07-22 |              6 |     4.0000000000000000
    (9 rows)
    

    On the right of the number of visits column is a moving average for each customer - which averages visits up to the last three periods.

    Specifically, this is achieved by using an OVER clause which we use to implement a window function, along with a PARTITION BY clause whereby we wish to average visits for each customer separately.

    This will not only tell you which customers are increasing or reducing their usage as measured by number of visits, but a progressively higher moving average will be a better indication as to whether such usage is being "increased" or "sustained" over the longer-term, as usage could have been abnormally high in the last week but not others.

    In the above hypothetical example, we see that:

    • Average usage is decreasing for customer 1

    • Average usage is remaining more or less consistent for customer 2

    • Average usage is increasing for customer 3