Search code examples
hivewindowhqlclause

Hive HQL - optimizing repetitive WINDOW clause


I have following HQL

SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
       min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
       max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime 
FROM t21_pam6

How can I define the 3 same WINDOW clauses into one?

The documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics ) shows this example

SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)

But I don't think it's working. It's not possible to define WINDOW w as... is not a HQL command.


Solution

  • This type of optimization is something that the compiler would need to do. I don't think there is a way to ensure this programmatically.

    That said, the calculation for the minimum time is totally unnecessary. Because of the order by, it should be the time in the current row. Similarly, if you can handle null values, then the expression can be simplified to:

    SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
           event.time as minTime,
           lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime 
    FROM t21_pam6;
    

    Note that the maxtime calculation is slightly different because it will return NULL for the last two values matching the conditions.