Search code examples
countsnowflake-cloud-data-platform

Snowflake - COUNT (DISTINCT ...) window function with ORDER BY


The goal is to find an equivalent of DISTINCT inside window COUNT for a sliding/cumulative window. According to COUNT:

When this function is called as a window function with an OVER clause that contains an ORDER BY clause:

  • Using the keyword DISTINCT inside the window function is prohibited and results in a compile-time error.

Sample:

SELECT *, COUNT(DISTINCT val) OVER(ORDER BY id)
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20),(6, 10) AS s(id, val)
ORDER BY id;

Error: distinct cannot be used with a window frame or an order.


Expected output:

SELECT *,
  COUNT(DISTINCT val) OVER(ORDER BY id) AS CNT_CUMULATIVE,
  COUNT(DISTINCT val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) 
     AS CNT_SLIDING
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20),(6, 10) AS s(id, val)
ORDER BY id;

+----+-----+----------------+-------------+
| ID | VAL | CNT_CUMULATIVE | CNT_SLIDING |
+----+-----+----------------+-------------+
|  0 |     |              0 |           0 |
|  1 |  10 |              1 |           1 |
|  2 |  20 |              2 |           2 |
|  3 |  20 |              2 |           2 |
|  4 |  30 |              3 |           3 |
|  5 |  20 |              3 |           2 |
|  6 |  10 |              3 |           3 |
+----+-----+----------------+-------------+

Solution

  • The idea is to utilize ARRAY_AGG to build an array and ARRAY_SIZE(ARRAY_DISTINCT...) to compute its size:

    SELECT *,
      ARRAY_AGG(val) OVER(ORDER BY id) AS AGG_HELPER1,
      ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id))) AS CNT_CUMULATIVE,
    
      ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS AGG_HELPER2,
      ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW))) 
           AS CNT_SLIDING,
    FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
    ORDER BY id;
    

    Output:

    enter image description here

    Effectively:

    COUNT(DISTINCT <col>) OVER(ORDER BY ...) <~> ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(<col>) OVER(ORDER BY ...)))

    Side note: "The maximum amount of data that ARRAY_AGG can return for a single call is 16MB."


    Bonus: This pattern can be extended for SUM(DISTINCT... )OVER(ORDER BY ...) too

    SELECT *,
      SUM(DISTINCT val) OVER(ORDER BY id) AS SUM_CUMULATIVE,
      SUM(DISTINCT val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS SUM_SLIDING,
    FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
    ORDER BY id;
    -- Error: distinct cannot be used with a window frame or an order. 
    

    Becomes:

    SELECT *,
      ARRAY_AGG(val) OVER(ORDER BY id) AS AGG_HELPER1,
      REDUCE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id)),
               0, (acc, x) -> acc + x) AS SUM_CUMULATIVE,
        
      ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS AGG_HELPER2,
      REDUCE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)),
               0, (acc, x) -> acc + x) AS SUM_SLIDING
    FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
    ORDER BY id;
    

    Output:

    enter image description here

    Effectively:

    SUM(DISTINCT <col>) OVER(ORDER BY ...) <~> REDUCE(ARRAY_DISTINCT(ARRAY_AGG(<col>) OVER(ORDER BY ...)), 0, (acc, val) -> acc + val)