Search code examples
apache-flinkflink-sql

How GROUP BY and OVER WINDOW differ in FlinkSQL?


One can use windows in Flink in two different manners

SELECT key, MAX(value)
  FROM table
 GROUP BY key, TUMBLE(ts, INTERVAL '5' MINUTE)

and

SELECT key, MAX(value) OVER w
  FROM table
 WINDOW w AS (PARTITION BY key ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

I wonder if those two mechanisms can express the same thing or what's the main difference and possible use cases for both?


Solution

  • Both queries compute different results that correspond to the semantics of regular SQL. So the difference is not Flink specific but defined by the SQL Standard.


    The first query

    SELECT key, MAX(value)
      FROM table
     GROUP BY key, TUMBLE(ts, INTERVAL '5' MINUTE)
    

    groups records per key and 5 minute buckets. Every 5 minutes the query produces one row per key value with the maximum value. For each group, multiple rows are aggregated to a single row.


    The second query

    SELECT key, MAX(value) OVER w
      FROM table
     WINDOW w AS (PARTITION BY key ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    

    produces one row for every row of the input (table). A result row has the maximum value for the key value that was observed so far (rows are ordered by ts). Note that multiple rows are not aggregated to a single row; each input row results in one output row. Moreover, the range of the maximum aggregation can be more than 5 minutes. In fact it's the whole key partition in this example.