Search code examples
sqlwindow-functionsimpalatrino

How to add conditional partition in a window function with trino or impala?


For example I have a dataset like the following:

time action
03:00:00 block
04:00:00 unblock
05:00:00 block
06:00:00 unblock
07:00:00 unblock
08:00:00 block

Now for each row, I want to get the last time when the column action equals to "block" before the time of current row. For example, for the fifth row whose time equals to "07:00:00" and action equals to "unblock", the last time before it when action equals to "block" should be the third row, and the expected time is "05:00:00".

My final expected result would be:

time action last_time
03:00:00 block 03:00:00
04:00:00 unblock 03:00:00
05:00:00 block 05:00:00
06:00:00 unblock 05:00:00
07:00:00 unblock 05:00:00
08:00:00 block 08:00:00

How can I get the above result by using a window function without joining by itself?

(p.s. if the above result cannot be reached, the following output is also okay:

time action last_time
03:00:00 block NULL
04:00:00 unblock 03:00:00
05:00:00 block 03:00:00
06:00:00 unblock 05:00:00
07:00:00 unblock 05:00:00
08:00:00 block 05:00:00

Solution

  • Once we determine that action = 'block' should begin a new block and we identify those blocks (block_no), we can then use a window function partitioned by that block_no to find the minimum time within each block.

    If the times aren't monotonically increasing, we can use the FIRST_VALUE window function instead if we have another way to order, or just use another case expression to grab the time only when action = 'block', which would leave the other rows with null, which is easily ignored via MAX/MIN/etc.

    Given the current data, however, I don't think we can work around that assumption that time needs to be monotonically increasing or at least increasing from block to block for all rows between each block.

    Try this:

    The fiddle

    WITH cte1 AS (
             SELECT *, SUM(CASE WHEN action = 'block' THEN 1 END) OVER (ORDER BY time) AS block_no FROM test
         )
    SELECT *, MIN(time) OVER (PARTITION BY block_no) AS block_time FROM cte1
     ORDER BY time
    ;
    

    The result:

    time action block_no block_time
    03:00:00 block 1 03:00:00
    04:00:00 unblock 1 03:00:00
    05:00:00 block 2 05:00:00
    06:00:00 unblock 2 05:00:00
    07:00:00 unblock 2 05:00:00
    08:00:00 block 3 08:00:00

    The setup:

    CREATE TABLE test (time varchar(20), action  varchar(20));
    
    INSERT INTO test VALUES
      ('03:00:00', 'block')
    , ('04:00:00', 'unblock')
    , ('05:00:00', 'block')
    , ('06:00:00', 'unblock')
    , ('07:00:00', 'unblock')
    , ('08:00:00', 'block')
    ;