The goal is to find an equivalent of DISTINCT
inside window COUNT
for a sliding/cumulative window. According to COUNT:
When this function is called as a window function with an OVER clause that contains an ORDER BY clause:
- Using the keyword DISTINCT inside the window function is prohibited and results in a compile-time error.
Sample:
SELECT *, COUNT(DISTINCT val) OVER(ORDER BY id)
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20),(6, 10) AS s(id, val)
ORDER BY id;
Error: distinct cannot be used with a window frame or an order.
Expected output:
SELECT *,
COUNT(DISTINCT val) OVER(ORDER BY id) AS CNT_CUMULATIVE,
COUNT(DISTINCT val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
AS CNT_SLIDING
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20),(6, 10) AS s(id, val)
ORDER BY id;
+----+-----+----------------+-------------+
| ID | VAL | CNT_CUMULATIVE | CNT_SLIDING |
+----+-----+----------------+-------------+
| 0 | | 0 | 0 |
| 1 | 10 | 1 | 1 |
| 2 | 20 | 2 | 2 |
| 3 | 20 | 2 | 2 |
| 4 | 30 | 3 | 3 |
| 5 | 20 | 3 | 2 |
| 6 | 10 | 3 | 3 |
+----+-----+----------------+-------------+
The idea is to utilize ARRAY_AGG
to build an array and ARRAY_SIZE(ARRAY_DISTINCT...)
to compute its size:
SELECT *,
ARRAY_AGG(val) OVER(ORDER BY id) AS AGG_HELPER1,
ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id))) AS CNT_CUMULATIVE,
ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS AGG_HELPER2,
ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)))
AS CNT_SLIDING,
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
ORDER BY id;
Output:
Effectively:
COUNT(DISTINCT <col>) OVER(ORDER BY ...)
<~> ARRAY_SIZE(ARRAY_DISTINCT(ARRAY_AGG(<col>) OVER(ORDER BY ...)))
Side note: "The maximum amount of data that ARRAY_AGG can return for a single call is 16MB."
Bonus: This pattern can be extended for SUM(DISTINCT... )OVER(ORDER BY ...)
too
SELECT *,
SUM(DISTINCT val) OVER(ORDER BY id) AS SUM_CUMULATIVE,
SUM(DISTINCT val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS SUM_SLIDING,
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
ORDER BY id;
-- Error: distinct cannot be used with a window frame or an order.
Becomes:
SELECT *,
ARRAY_AGG(val) OVER(ORDER BY id) AS AGG_HELPER1,
REDUCE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id)),
0, (acc, x) -> acc + x) AS SUM_CUMULATIVE,
ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS AGG_HELPER2,
REDUCE(ARRAY_DISTINCT(ARRAY_AGG(val) OVER(ORDER BY id ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)),
0, (acc, x) -> acc + x) AS SUM_SLIDING
FROM VALUES (0, NULL),(1,10),(2,20),(3,20),(4,30),(5, 20), (6, 10) AS s(id, val)
ORDER BY id;
Output:
Effectively:
SUM(DISTINCT <col>) OVER(ORDER BY ...)
<~> REDUCE(ARRAY_DISTINCT(ARRAY_AGG(<col>) OVER(ORDER BY ...)), 0, (acc, val) -> acc + val)