I want to pick latest 2
records based on the timestamp column, let's say to perform analysis.
In reality i have huge dataset where i want to pick n
records for each of my user_n.
If i could hardcode the value, it would have been okay.
SELECT *
FROM
`project-id`.huge-dataset-id.streaming-data`
WHERE timestamp_column BETWEEN '2015-06-10 14:20' AND '2015-06-10 14:30'
But it has to be formulated dynamic.I am not so strong in SQL and BigQyery.Please help.
I need Latest N
Records
Tried - (Working Solution), Need more optimised Solution.
SELECT *
FROM (
SELECT *
FROM `project-id`.`dataset`.`streaming_data` s1
WHERE (
SELECT COUNT(*)
FROM `project-id`.`dataset`.`streaming_data` s2
WHERE s1.user_n = s2.user_n
AND s1.timestamp_col <= s2.timestamp_col
) <= 2
)
Since, there are too many nested query, the operation is really slow. Can someone provide an optimised solution.
Data Table as streaming_data
-------------------------------
| user_n | timestamp_column |
|-----------------------------|
| ABC | 10-Jun-12 14.30 |
| DEF | 10-Jun-12 14.30 |
| ABC | 10-Jun-12 14.20 |
| DEF | 10-Jun-12 14.20 |
| ABC | 10-Jun-12 14.10 |
| DEF | 10-Jun-12 14.10 |
| ABC | 10-Jun-12 14.00 |
| DEF | 10-Jun-12 14.00 |
Expected Output : Latest 2 Records for all user, based on the latest timestamp
-------------------------------
| user_n | timestamp_column |
|-----------------------------|
| ABC | 10-Jun-12 14.30 |
| DEF | 10-Jun-12 14.30 |
| ABC | 10-Jun-12 14.20 |
| DEF | 10-Jun-12 14.20 |
Use row_number()
. For instance, to get the most recent records:
SELECT sd.* EXCEPT (seqnum)
FROM (SELECT sd.*,
ROW_NUMBER() OVER (PARTITION BY user_n ORDER BY timestamp_column DESC) as seqnum
FROM `project-id`.huge-dataset-id.streaming-data` sd
WHERE timestamp_column BETWEEN '2015-06-10 14:20' AND '2015-06-10 14:30'
) sd
WHERE seqnum <= 2;
If you want two random rows, use ORDER BY rand()
instead.
The timestamp range is included because that is in the question. However, to get the two most recent rows, you can remove it.