Search code examples
clickhouse

When is lagInFrame behavior different from standard SQL LAG?


The clickhouse documentation states:

lagInFrame behavior differs from the standard SQL lag window function. Clickhouse window function lagInFrame respects the window frame.

What is this window frame, and how does it affect the output?

Example: I want to find rows in a time series where the difference in the time column value for two consecutive rows is greater than a given threshold.

I want to compare each row with the previous row.

The following leads me to believe that the query below is the correct approach.

To get behavior identical to the lag, use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND
                  UNBOUNDED FOLLOWING) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

However, the following query gives me the exact same result and uses less memory.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND
                  CURRENT ROW) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

The difference is:
OVER(ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
vs
OVER(ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

Are there cases where the output of the queries will differ?


Solution

  • I wrote this documentation in such way to reduce the number of false bug reports about the leadInFrame:

    People just don't understand how window function's frame works in case of OVER (ORDER BY somecol) https://clickhouse.com/docs/en/sql-reference/window-functions#syntax

    That the frame is bounded by the beginning of a partition and the current row. And it's an equalent of ORDER BY order ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

    And the next row is not visible for leadInFrame.

    It's more convenient for the usual people to use BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, because it's more compatible with lag/lead and works in all cases with any offset.


    Are there cases where the output of the queries will differ?

    For example if you use another offset: lagInFrame(startdate,2), then you need to use BETWEEN 2 PRECEDING AND CURRENT ROW https://fiddle.clickhouse.com/1ce6547a-109a-4552-9f1b-b9c03f972988


    BTW, you can use any() over ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING and will get the same result: https://fiddle.clickhouse.com/cfa1df1e-2571-4167-9fc1-0f3b1c8f86ac