Search code examples
sqlimpalacumulative-sum

Impala - How to query cumulative distinct number of product sold over time?


Just started learning sql query in Impala and I still cannot figure out how to get the desired results.

So I have a table:

enter image description here

And I would like to find out how to get the number of distinct product sold over time (from 1 Dec to 3 Dec 2020).

So the 1st day we sold 2 types of product (product_id 1 and 2), the second day we sold the same type of product comparing day before, therefore the cumulative remains as 2 and the last day we sold a product that was not sold two days before therefore the cumulative will be 3 types of products sold on the 3rd day

enter image description here

Thanks in advance!


Solution

  • Simply aggregate to get the first time something is sold and then do a simple aggregation and cumulative sum:

    select min_sold_date, sum(count(*)) over (order by min_sold_date)
    from (select product_id, min(sold_date) as min_sold_date
          from t
          where sold_date >= ? and sold_date <= ?
          group by product_id
         ) t
    group by min_sold_date;
    

    For your sample data, this will not include Dec 2nd. I am guessing that is not a problem in your real data. If it is, you can use a left join to bring in all the dates.