Search code examples
sqldb2lag

DB2 SQL: fastest way to get lagged value of many columns


There are many ways to get a lagged value of a certain column in SQL, e.g.:

WITH CTE AS (
  SELECT
    rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
    value
  FROM table
)
SELECT
  curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1

,or:

select variable_of_interest 
               ,lag(variable_of_interest ,1) 
                    over(partition by
                    some_group order by variable_1,...,variable_n) 
                    as lag_variable_of_interest
from DATA

I use the second version, but my code runs very slow when "lagging" many variables, such that my code becomes:

select        variable_of_interest_1
              ,variable_of_interest_2
              ,variable_of_interest_3
                   ,lag(variable_of_interest_1 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_1
                    ,lag(variable_of_interest_2 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_2
                   ,lag(variable_of_interest_3 ,1) 
                        over(partition by
                        some_group order by variable_1,...,variable_n) 
                        as lag_variable_of_interest_3
    from DATA

I wonder, is this because each lag function must by its own partition and order the whole data set, even though the are using the same partition and order?


Solution

  • I am not 100% sure about how DB2 optimizes such queries. If it executes each lag independently, then there is definitely room to improve the optimizer.

    One method you could use is lag() with a join on the primary key:

    select t.*, tprev.*
    from (select t.*, lag(id) over ( . . . ) as prev_id
          from t
         ) t left join
         t tprev
         on t.id = tprev.prev_id ;
    

    From what you describe, this might be the most efficient method to do what you want.

    This should be more efficient than row_number() because the join can make use of an index.