Search code examples
sqlpostgresqlgaps-and-islands

Count of Sequential occurrences and their lengths (advanced gaps and islands problem)


I am trying to find out the count of sequential occurrences and their lengths for a certain user action. Let's say it is monthly logins. Below is a simplified structure of my table and my expected end result.

PostgreSQL Schema Setup:

  CREATE TABLE user_actions (
        action_date VARCHAR(255),
        user_id VARCHAR(255)
    );
INSERT INTO user_actions(action_date, user_id)
VALUES('2020-03', 'alex01'),
('2020-04', 'alex01'),
('2020-05', 'alex01'),
('2020-06', 'alex01'),
('2020-12', 'alex01'),
('2021-01', 'alex01'),
('2021-02', 'alex01'),
('2021-03', 'alex01'),
('2020-04', 'jon03'),
('2020-05', 'jon03'),
('2020-06', 'jon03'),
('2020-09', 'jon03'),
('2021-11', 'jon03'),
('2021-12', 'jon03'),
('2022-01', 'jon03'),
('2022-02', 'jon03'),
('2020-05', 'mark05'),
('2020-06', 'mark05'),
('2020-07', 'mark05'),
('2020-08', 'mark05'),
('2020-09', 'mark05')
  • alex01 have 2 streak with length of 4
  • jon03 have 3 streaks with length of 1, 3 and 4
  • mark05 have 1 streak with length of 5 And we would like to get the occurrence counts of longest streaks per user
Streak Length # of occurrences
5 1
4 2

I have tried to research if this type of a problem was asked before and found this question . Although the answer was quite helpful but not exactly what I need


Solution

  • You can follow these steps to go from your varchar dates to your final output:

    • format your dates with DATE datatype, to allow for dates comparison
    • compute when each date value is not consecutive for each "user_id", flag it with 1 and that will represent the beginning of a new partition
    • compute a running sum on your flag, to generate your partition values
    • retrieve the biggest count of dates within each partition
    • count how many counts you have found in the previous step

    Each of these steps is made within a subquery:

    WITH cte AS (
        SELECT user_id, 
               TO_DATE(action_date || '-01','YYYY-MM-DD') AS date_as_date
        FROM user_actions
    ), islands AS ( 
        SELECT *, 
               CASE WHEN LAG(date_as_date) OVER(PARTITION BY user_id ORDER BY date_as_date) + INTERVAL '1 month' <> date_as_date
                    THEN 1 ELSE 0 
               END AS new_partition
        FROM cte
    ), partitioned_islands AS (
        SELECT user_id, 
               SUM(new_partition) OVER(PARTITION BY user_id ORDER BY date_as_date) AS partitions
        FROM islands
    ), count_per_island AS (
        SELECT COUNT(*) AS num_consecutive_dates
        FROM partitioned_islands
        GROUP BY user_id, partitions
        ORDER BY ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY COUNT(*) DESC)
        FETCH FIRST 1 ROWS WITH TIES
    )
    SELECT num_consecutive_dates AS streakLength, 
           COUNT(*) AS numOfOccurrences  
    FROM count_per_island
    GROUP BY streakLength
    ORDER BY streakLength DESC
    

    Check the demo here.