For each event, I want to get the reference from the most recent timestamp of the start event for that specific user_id.
I am using BigQuery, with a starting table table below.
There will be multiple events for each user most of the time.
order_no | event | timestamp | reference | user_id |
---|---|---|---|---|
null | start | 1 | #1232 | 1 |
null | view | 2 | #1235 | 1 |
null | view | 3 | #1236 | 1 |
null | view | 4 | #1237 | 1 |
null | basket | 5 | #1238 | 1 |
123 | purchase | 6 | #1239 | 1 |
null | view | 7 | #1278 | 1 |
null | start | 8 | #1280 | 1 |
null | basket | 8 | #1280 | 1 |
125 | purchase | 9 | #1292 | 1 |
null | start | 10 | #1380 | 2 |
null | basket | 11 | #1380 | 2 |
126 | purchase | 12 | #1392 | 2 |
I would like a resulting table like this, appending a new column with the previous reference from the users most recent start event:
order_no | event | timestamp | reference | user_id | previous_ref |
---|---|---|---|---|---|
null | start | 1 | #1232 | 1 | null |
null | basket | 5 | #1238 | 1 | #1232 |
123 | purchase | 6 | #1239 | 1 | #1232 |
null | start | 8 | #1280 | 1 | null |
null | basket | 8 | #1282 | 1 | #1280 |
125 | purchase | 9 | #1292 | 1 | #1280 |
null | start | 10 | #1380 | 2 | null |
null | basket | 11 | #1385 | 2 | #1380 |
126 | purchase | 12 | #1392 | 2 | #1380 |
I have tried to use a subquery to select the data, but this was rejected by Bigquery as self referencing error. This seems to be the biggest challenge as a subquery seemed the most natural solution.
I have also tried using a lag function:
SELECT *,
LAG(case when event = 'start' then reference end)
OVER (partition by user_id order by timestamp) AS previous_ref
from table
But using the lag function will return null if the Start event does not directly proceed the next event. If there is any prior start event, this value should be used. Is there a way to make the query iterate until case function condition met?
I have read similar questions, but they often don't have repeating events and can use the max value at all times, this doesn't work for my table.
Ideally I would like to create a function that is usable in a materialized view within BigQuery, but making the query work is a priority.
Use below approach (BigQuery Standard SQL)
SELECT * EXCEPT(grp),
IF(event = 'start', NULL, FIRST_VALUE(reference) OVER win) AS previous_ref
FROM (
SELECT *, COUNTIF(event = 'start') OVER win AS grp
FROM your_table
WHERE event != 'view'
WINDOW win AS (PARTITION BY user_id ORDER BY timestamp, IF(event = 'start', 1, 2))
)
WINDOW win AS (PARTITION BY user_id, grp ORDER BY timestamp, IF(event = 'start', 1, 2))
ORDER BY timestamp, IF(event = 'start', 1, 2)
If applied to sample data in your question - output is