Search code examples
sqlcountamazon-redshiftdistinctwindow-functions

Count distinct customers over rolling window partition


My question is similar to redshift: count distinct customers over window partition but I have a rolling window partition.

My query looks like this but distinct within COUNT in Redshift is not supported

select p_date, seconds_read, 
count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
from table_x

My goal is to calculate total unique customer up to every date (hence rolling window).

I tried using the dense_rank() approach but it would simply fail since I cannot use window function like this

select p_date, max(total_cumulative_customer) over ()
(select p_date, seconds_read, 
dense_rank() over (order by customer_id rows between unbounded preceding and current row) as total_cumulative_customer -- WILL FAIL HERE
from table_x

Any workaround or different approach would be helpful!

EDIT:

INPUT DATA sample

+------+----------+--------------+
| Cust |  p_date  | seconds_read |
+------+----------+--------------+
|    1 | 1-Jan-20 |           10 |
|    2 | 1-Jan-20 |           20 |
|    4 | 1-Jan-20 |           30 |
|    5 | 1-Jan-20 |           40 |
|    6 | 5-Jan-20 |           50 |
|    3 | 5-Jan-20 |           60 |
|    2 | 5-Jan-20 |           70 |
|    1 | 5-Jan-20 |           80 |
|    1 | 5-Jan-20 |           90 |
|    1 | 7-Jan-20 |          100 |
|    3 | 7-Jan-20 |          110 |
|    4 | 7-Jan-20 |          120 |
|    7 | 7-Jan-20 |          130 |
+------+----------+--------------+

Expected Output

+----------+--------------------------+------------------+--------------------------------------------+
|  p_date  | total_distinct_cum_cust  | sum_seconds_read |                  Comment                   |
+----------+--------------------------+------------------+--------------------------------------------+
| 1-Jan-20 |                        4 |              100 | total distinct cust = 4 i.e. 1,2,4,5       |
| 5-Jan-20 |                        6 |              450 | total distinct cust = 6 i.e. 1,2,3,4,5,6   |
| 7-Jan-20 |                        7 |              910 | total distinct cust = 6 i.e. 1,2,3,4,5,6,7 |
+----------+--------------------------+------------------+--------------------------------------------+

Solution

  • For this operation:

    select p_date, seconds_read, 
           count(distinct customer_id) over (order by p_date rows between unbounded preceding and current row) as total_cumulative_customer
    from table_x;
    

    You can do pretty much what you want with two levels of aggregation:

    select min_p_date,
           sum(count(*)) over (order by min_p_date rows between unbounded preceding and current row) as running_distinct_customers
    from (select customer_id, min(p_date) as min_p_date
          from table_x
          group by customer_id
         ) c
    group by min_p_date;
    

    Summing the seconds read as well is a bit tricky, but you can use the same idea:

    select p_date,
           sum(sum(seconds_read)) over (order by p_date rows between unbounded preceding and current row) as seconds_read,
           sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by p_date rows between unbounded preceding and current row) as running_distinct_customers
    from (select customer_id, p_date, seconds_read,
                 row_number() over (partition by customer_id order by p_date) as seqnum
          from table_x
         ) c
    group by min_p_date;