I have a SQL query with the objective to extract the values of the row with the last date for each different value that exists in column_2.
I used the window function on all the columns that might have a variation in values, trying to ensure that I got the most recent ones. Column_3 and column_4 are always the same for each element in column_2.
DECLARE _timestamp TIMESTAMP;
DECLARE _timestamp_start TIMESTAMP;
SET _timestamp = TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 MINUTE);
SET _timestamp_start = TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 24 HOUR);
SELECT DISTINCT
FIRST_VALUE(day) OVER (PARTITION BY column_2 ORDER BY day DESC) as day_last_value,
column_2,
column_3,
column_4,
FIRST_VALUE(xxx) OVER (PARTITION BY column_2 ORDER BY day DESC) as xxx_last_value,
FIRST_VALUE(yyy) OVER (PARTITION BY column_2 ORDER BY day DESC) as yyy_last_value,
FIRST_VALUE(zzz) OVER (PARTITION BY column_2 ORDER BY day DESC) as zzz_last_value,
FROM
table1
WHERE
day BETWEEN DATE(_timestamp_start) AND DATE(_timestamp)
What I'm finding very strange is that I had to set up the query with FIRST_VALUE and DESC order to get to the line with the last date.
That's because by changing to LAST_VALUE and ASC order it's resulting twice as many lines... It is not retrieving the results for only one day for each element in column_2, as it should.
My expectation was that the number of rows would be the same.
Using window function and distinct is a bad practice. What you really want to do is grouping by column_2 and aggregate others using this column. If column_3 and 4 are always same for column_2, you can aggregate using min / max or any_value.
To get first and last value, you can use MIN_BY and MAX_BY functions. You can find more details about this function here
SELECT
column_2,
MAX(day) as day_last_value,
ANY_VALUE(column_3) as column_3,
ANY_VALUE(column_4) as column_4,
MAX_BY(xxx, day) as xxx_last_value,
MAX_BY(yyy, day) as yyy_last_value,
MAX_BY(zzz, day) as zzz_last_value,
FROM
table1
WHERE
day BETWEEN DATE(_timestamp_start) AND DATE(_timestamp)