Search code examples
postgresqlcountdistinctwindow-functionscumulative-sum

Distinct cumulative count in PostgreSQL window function


I've got a table with many columns some of these are: product_id, territory_id, quarter_num (it's a number of a quarter from 1 to 28 for instance). There are some other columns but they aren't necessary in this query. I need to count number of distinct products in every territory in every cumulative quarter: only 1 first, 1+2 second, 1+2+3 third and so on until from 1 till 28. Before this query was realized in QlikSence with a loop. Now I need to rewrite it in PostgreSQL in one query (even in one CTE part of a long query) using standard SQL with no loops etc.

It would be simply something like this:

select *
    ,count(distinct product_id) 
        filter(where some_condition) 
        over(partition by territory_id order by quarter_num) 
    as cum_filtered_product_count
from some_table

If I had no distinct which not realized in window functions. I've broken my head, read and tried to use many advices here but still have found no correct solution. Any help will be appreciated.

PS The solution with two subquieris where the first one counts distinctly in a single quarter in a group and the second one sums the results of the first one in a window function cumulatively doesn't work. Because the latter subquery potentially sums the same products.


Solution

  • You can use intarray to emulate the missing count(distinct x)over() using sets: db<>fiddle

    select distinct quarter_num
      ,territory_id
      ,#uniq(sort(array_agg(product_id) 
                  filter(where product_id<>13) 
                  over(partition by territory_id 
                       order by quarter_num)))
        as cum_filtered_product_count
    from some_table
    order by 1,2;
    

    By turning the aggregated array into a set, you only keep distinct elements, and # tells you how many you got. You can also use -'{}' as a trick to quietly turn the array into a set, but while shorter, the operation still involves the uniq(sort()), plus the empty subtraction.

    If product_id isn't an int and you don't want to map it, you can lean on the fact that jsonb keys are also sets, so aggregating into a jsonb object will by nature only keep unique keys:

    select distinct quarter_num
      ,territory_id
      ,jsonb_array_length(jsonb_path_query_array(cum_filtered_product,'$.*'))
       as cum_filtered_product_count
    from (
    select *,jsonb_object_agg(product_id,0) 
               filter(where product_id<>13) 
               over(partition by territory_id order by quarter_num) 
               as cum_filtered_product
    from some_table)_
    order by 1,2;
    

    The two jsonb.. functions just extract and count the keys.

    You can take a closer look at what this does in the demo. Some_condition is product_id<>13. Note how it discards duplicates from both the same as well as earlier quarters for a given territory:

    t q products cumulative filtered distinct count_distinct
    1 1 {1,1,13} {1,1,13} {1,1} {1} 1
    1 2 {3,5,7} {1,1,3,5,7,13} {1,1,3,5,7} {1,3,5,7} 4
    1 3 {8,8,11} {1,1,3,5,7,8,8,11,13} {1,1,3,5,7,8,8,11} {1,3,5,7,8,11} 6
    2 1 {1,6,11} {1,6,11} {1,6,11} {1,6,11} 3
    2 2 {4,6,7} {1,4,6,6,7,11} {1,4,6,6,7,11} {1,4,6,7,11} 5
    2 3 {12,13,14} {1,4,6,6,7,11,12,13,14} {1,4,6,6,7,11,12,14} {1,4,6,7,11,12,14} 7
    3 1 {3,11,13} {3,11,13} {3,11} {3,11} 2
    3 2 {0,3,12} {0,3,3,11,12,13} {0,3,3,11,12} {0,3,11,12} 4
    3 3 {0,9,13} {0,0,3,3,9,11,12,13,13} {0,0,3,3,9,11,12} {0,3,9,11,12} 5