Search code examples
sqlgoogle-bigquerytimestamp

Create table based on most recent timestamp and product id


I have a table that has the following columns:

Timestamp Product ID Column A Column B etc...
2024-12-03 id_123 AAA BBB CCC
2024-12-03 id_234 AAA BBB CCC
2024-12-02 id_123 AAA BBB CCC
2024-12-02 id_234 AAA BBB CCC

I want to build it so that it only returns the most recent occurrences for each product ID. So for the example above, it would only return:

Timestamp Product ID Column A Column B etc...
2024-12-03 id_123 AAA BBB CCC
2024-12-03 id_234 AAA BBB CCC

I've tried using MAX(Timestamp) and DISTINCT, but can't seem to get those to work.

Does anyone have any suggestions?

Thank you!

I tried using MAX(Timestamp) but it didn't seem to actually limit it to the most recent date.

DISTINCT seems to only work by comparing the values of the entire row to another row.


Solution

  • What you're looking for here is a window/analytic function.

    Here's an example that should work:

        SELECT
            timestamp,
            product_id,
            column_a,
            column_b,
          FROM
            your_dataset.your_table
       QUALIFY
            ROW_NUMBER() OVER(latest_product_record) = 1
        WINDOW
            latest_product_record AS (
              PARTITION BY product_id
              ORDER BY timestamp DESC
            )
    

    If you aren't familiar with this syntax:

    • OVER contains the conditions for defining the window over which you will calculate some quantity
    • in this case we use the ROW_NUMBER function to return a number for each row based on its position within its associated window
    • QUALIFY is an equivalent of WHERE that takes a window clause as its condition.
    • WINDOW allows you to declare the definition of a window (which goes in the OVER clause; a convenient way to keep the query tidy)

    See the official documentation for more detail.

    In words, the query does the following:

    • for each row in the original table, find all rows with the same Product ID
    • order these rows by timestamp, in descending order (first row is the last timestamp)
    • label each row according to its position in that order (1 = row with the last timestamp for that product ID)
    • now filter the output of the query so that only rows where ROW_NUMBER returns 1 are included (i.e. only rows with the last timestamp for that product ID)