Search code examples
sqldategoogle-bigquerywindow-functions

FIRST and LAST VALUE bringing conflicting row quantities in the same query. WHY?


I have a SQL query with the objective to extract the values of the row with the last date for each different value that exists in column_2.

I used the window function on all the columns that might have a variation in values, trying to ensure that I got the most recent ones. Column_3 and column_4 are always the same for each element in column_2.

DECLARE _timestamp TIMESTAMP;
DECLARE _timestamp_start TIMESTAMP;

SET _timestamp = TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 MINUTE);
SET _timestamp_start = TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 24 HOUR);


SELECT DISTINCT
    FIRST_VALUE(day) OVER (PARTITION BY column_2 ORDER BY day DESC) as day_last_value,
    column_2,
    column_3,
    column_4,
    FIRST_VALUE(xxx) OVER (PARTITION BY column_2 ORDER BY day DESC) as xxx_last_value,
    FIRST_VALUE(yyy) OVER (PARTITION BY column_2 ORDER BY day DESC) as yyy_last_value,
    FIRST_VALUE(zzz) OVER (PARTITION BY column_2 ORDER BY day DESC) as zzz_last_value,
  FROM 
    table1
  WHERE 
    day BETWEEN DATE(_timestamp_start) AND DATE(_timestamp)

What I'm finding very strange is that I had to set up the query with FIRST_VALUE and DESC order to get to the line with the last date.

That's because by changing to LAST_VALUE and ASC order it's resulting twice as many lines... It is not retrieving the results for only one day for each element in column_2, as it should.

My expectation was that the number of rows would be the same.


Solution

  • Using window function and distinct is a bad practice. What you really want to do is grouping by column_2 and aggregate others using this column. If column_3 and 4 are always same for column_2, you can aggregate using min / max or any_value.

    To get first and last value, you can use MIN_BY and MAX_BY functions. You can find more details about this function here

    SELECT
        column_2,
        MAX(day) as day_last_value,
        ANY_VALUE(column_3) as column_3,
        ANY_VALUE(column_4) as column_4,
        MAX_BY(xxx, day) as xxx_last_value,
        MAX_BY(yyy, day) as yyy_last_value,
        MAX_BY(zzz, day) as zzz_last_value,
    FROM 
        table1
    WHERE 
        day BETWEEN DATE(_timestamp_start) AND DATE(_timestamp)