Cross timeline count across subscriptions for customers

I have a subscription table with 4 fields: id, customer_id, start_date and end_date. It lists all subscriptions of my customers. No subscription has an empty end_date, and a customer can have several subscriptions at once.

For example, a customer id 37 can have the following subscriptions:

id	customer_id	start_at	end_at
44	37	2019-03-21	2019-03-21
17819	37	2020-03-23	2020-03-23
22302	37	2020-04-24	2021-07-25
42213	37	2021-04-25	2023-04-26
92013	37	2023-04-26	2024-04-26

These records mean that customer 37 was a subscriber on 2019-03-21, then on 2020-03-23, then from 2020-04-24 to 2024-04-26, for a total of 1463 days.

I am trying to write a query to get the number of days each customer has been a subscriber in a given period. Customer 37 has been a subscriber 365 days in 2023. Subscriptions can overlap, as a subscriber can have multiple subscriptions at once.

The result of the query should be something like:

customer_id	total_subscription_days
37	1463
38	526
39	426
40	365
41	325

My database is running on MySQL 8.2.12.

I tried using lag, lead, CTEs, least and greatest, to no avail. I tried chatgpt and stackoverflow.

EDIT: here below is what I tried so far:

1st try:

SELECT 
    customer_id,
    SUM(DATEDIFF(
        LEAST(end_at, '2023-12-31'), 
        GREATEST(start_at, '2023-01-01')
    ) + 1) AS total_subscription_days
FROM (
    SELECT 
        customer_id,
        start_at,
        end_at
    FROM 
        subscription
    WHERE 
        start_at <= '2023-12-31' AND end_at >= '2023-01-01'
    UNION ALL
    SELECT 
        s1.customer_id,
        LEAD(s1.end_at) OVER (PARTITION BY s1.customer_id ORDER BY s1.end_at),
        '2023-12-31'
    FROM 
        subscription s1
    LEFT JOIN 
        subscription s2 ON s1.customer_id = s2.customer_id 
                       AND s1.end_at < s2.start_at
    WHERE 
        s2.start_at IS NOT NULL
) AS merged_subscriptions
GROUP BY 
    customer_id;

Even though I wanted to know subscription days in 2023, I had results bigger than 365 days. So it seems to count duplicates because of the join.

2nd try:

WITH subscription_periods AS (
    SELECT 
        customer_id,
        start_at,
        end_at,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY start_at) AS period_number
    FROM 
        subscription
    WHERE 
        start_at <= '2023-12-31' AND end_at >= '2023-01-01' AND customer_id < 100
),
subscription_days AS (
    SELECT 
        customer_id,
        SUM(
            DATEDIFF(
                LEAST(end_at, '2023-12-31'), 
                GREATEST(start_at, '2023-01-01')
            ) + 1
        ) AS days
    FROM 
        (
            SELECT 
                customer_id,
                start_at,
                LEAD(end_at) OVER (PARTITION BY customer_id ORDER BY start_at) AS end_at
            FROM 
                subscription_periods
        ) AS overlapping_periods
    WHERE 
        end_at >= '2023-01-01'
    GROUP BY 
        customer_id
)
SELECT 
    customer_id,
    SUM(days) AS total_subscription_days
FROM 
    subscription_days
GROUP BY 
    customer_id;

I restricted to the first 100 customers, otherwise I was getting a 504 error. This query seems to not be able to take into accounts gaps between subscription. For a customer having been subscribed from 2023-01-01 to 2023-04-01 and then from 2023-05-01 to 2023-08-01, it seems to be counting days between 2023-01-01 and 2023-08-01.

Solution

It is kind of a merge overlapping intervals problem:

Find all rows that intersect the given range e.g. 2023-01-01 to 2023-12-31
Process rows ordered by start date and assign group numbers as follows:
If there is a gap between previous group and current row then start a new group
Group the result by keys and group number, clamp the dates that are outside the range and calculate the difference between min and max date of each group

set @date1 = '2023-01-01';
set @date2 = '2023-12-31';

with cte1 as (
    select customer_id, start_at, end_at
    from t
    where start_at <= @date2 and end_at >= @date1
), cte2 as (
    select *, case when start_at <= max(end_at) over (partition by customer_id order by start_at rows between unbounded preceding and 1 preceding) then 0 else 1 end as newgrp
    from cte1
), cte3 as (
    select *, sum(newgrp) over (partition by customer_id order by start_at) as grpnum
    from cte2
)
select
    customer_id,
    greatest(min(start_at), @date1) as date1,
    least(max(end_at), @date2) as date2,
    datediff(
        least(max(end_at), @date2),
        greatest(min(start_at), @date1)
    ) + 1 as diff
from cte3
group by customer_id, grpnum

DB<>Fiddle - returns total for each island
DB<>Fiddle - returns total for all islands