Search code examples
sqlgoogle-bigquerywindow-functions

How to create select the most recent row value with a specific parameter value, when there can be repeat events in BigQuery


For each event, I want to get the reference from the most recent timestamp of the start event for that specific user_id.

I am using BigQuery, with a starting table table below.

There will be multiple events for each user most of the time.

order_no event timestamp reference user_id
null start 1 #1232 1
null view 2 #1235 1
null view 3 #1236 1
null view 4 #1237 1
null basket 5 #1238 1
123 purchase 6 #1239 1
null view 7 #1278 1
null start 8 #1280 1
null basket 8 #1280 1
125 purchase 9 #1292 1
null start 10 #1380 2
null basket 11 #1380 2
126 purchase 12 #1392 2

I would like a resulting table like this, appending a new column with the previous reference from the users most recent start event:

order_no event timestamp reference user_id previous_ref
null start 1 #1232 1 null
null basket 5 #1238 1 #1232
123 purchase 6 #1239 1 #1232
null start 8 #1280 1 null
null basket 8 #1282 1 #1280
125 purchase 9 #1292 1 #1280
null start 10 #1380 2 null
null basket 11 #1385 2 #1380
126 purchase 12 #1392 2 #1380

I have tried to use a subquery to select the data, but this was rejected by Bigquery as self referencing error. This seems to be the biggest challenge as a subquery seemed the most natural solution.

I have also tried using a lag function:

SELECT *, 
LAG(case when event = 'start' then reference end)
OVER (partition by user_id order by timestamp) AS previous_ref
 from table

But using the lag function will return null if the Start event does not directly proceed the next event. If there is any prior start event, this value should be used. Is there a way to make the query iterate until case function condition met?

I have read similar questions, but they often don't have repeating events and can use the max value at all times, this doesn't work for my table.

Ideally I would like to create a function that is usable in a materialized view within BigQuery, but making the query work is a priority.


Solution

  • Use below approach (BigQuery Standard SQL)

    SELECT * EXCEPT(grp), 
      IF(event = 'start', NULL, FIRST_VALUE(reference) OVER win) AS previous_ref
    FROM (
      SELECT *, COUNTIF(event = 'start') OVER win AS grp
      FROM your_table
      WHERE event != 'view'
      WINDOW win AS (PARTITION BY user_id ORDER BY timestamp, IF(event = 'start', 1, 2))
    )
    WINDOW win AS (PARTITION BY user_id, grp ORDER BY timestamp, IF(event = 'start', 1, 2))
    ORDER BY timestamp, IF(event = 'start', 1, 2)
    

    If applied to sample data in your question - output is

    enter image description here