Search code examples
sqlpostgresqlcountdistinctaggregation

Getting counts for overlapping time periods


I have a table data in PostgreSQL with this structure:

created_at.       customer_email               status
2020-12-31        xxx@gmail.com                opened
...
2020-12-24        yyy@gmail.com                delivered
2020-12-24        xxx@gmail.com                opened
...
2020-12-17        zzz@gmail.com                opened
2020-12-10        xxx@gmail.com                opened
2020-12-03        hhh@gmail.com                enqueued
2020-11-27        xxx@gmail.com                opened
...
2020-11-20        rrr@gmail.com                opened
2020-11-13        ttt@gmail.com                opened

There are many rows for each day.

Basically I need 2021-W01 for this week with the count of unique emails with status "opened" within the last 90 days. Likewise for every week before that.

Desired output:

period    active
2021-W01  1539
2020-W53  1480
2020-W52  1630
2020-W51  1820
2020-W50  1910
2020-W49  1890
2020-W48  2000

How can I do that?


Solution

  • Window functions would come to mind. Alas, those don't allow DISTINCT aggregations.

    Instead, get distinct counts from a LATERAL subquery:

    WITH weekly_dist AS (
       SELECT DISTINCT date_trunc('week', created_at) AS wk, customer_email
       FROM   tbl
       WHERE  status = 'opened'
       )
    SELECT to_char(t.wk, 'YYYY"-W"IW') AS period, ct.active
    FROM  (
       SELECT generate_series(date_trunc('week', min(created_at) + interval '1 week')
                            , date_trunc('week', now()::timestamp)
                            , interval '1 week') AS wk   
       FROM   tbl
       ) t
    LEFT   JOIN LATERAL (
       SELECT count(DISTINCT customer_email) AS active
       FROM   weekly_dist d
       WHERE  d.wk >= t.wk - interval '91 days'
       AND    d.wk <  t.wk
       ) ct ON true;
    

    db<>fiddle here

    I operate with timestamp, not timestamptz, might make a corner case difference.

    The CTE weekly_dist reduces the set to distinct "opened" emails. This step is strictly optional, but increases performance significantly if there can be more than a few duplicates per week.

    The derived table t generates a timestamp for the begin of each week since the earliest entry in the table up to "now". This way I make sure no week is skipped,even if there are no rows for it. See:

    But I do skip the first week since I count active emails before each start of the week.

    Then LEFT JOIN LATERAL to a subquery computing the distinct count for the 90-day time-range. To be precise, I deduct 91 days, and exclude the start of the current week. This happens to fall in line with the weekly pre-aggregated data from the CTE. Be wary of that if you shift bounds.

    Finally, to_char(t.wk, 'YYYY"-W"IW') is a compact expression to get your desired format for week numbers. Details in the manual here.