I need some help writing/optimizing a query to retrieve the latest version of each row by type and performing some calculations depending on the type. I think would be best if I illustrate it with an example.
Given the following dataset:
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| id | event_type | event_timestamp | message_id | sent_at | status | rate |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
| 1 | create | 2016-11-25 09:17:48 | 1 | 2016-11-25 09:17:48 | 0 | 0.500000 |
| 2 | status_update | 2016-11-25 09:24:38 | 1 | 2016-11-25 09:28:49 | 1 | 0.500000 |
| 3 | create | 2016-11-25 09:47:48 | 2 | 2016-11-25 09:47:48 | 0 | 0.500000 |
| 4 | status_update | 2016-11-25 09:54:38 | 2 | 2016-11-25 09:48:49 | 1 | 0.500000 |
| 5 | rate_update | 2016-11-25 09:55:07 | 2 | 2016-11-25 09:50:07 | 0 | 1.000000 |
| 6 | create | 2016-11-26 09:17:48 | 3 | 2016-11-26 09:17:48 | 0 | 0.500000 |
| 7 | create | 2016-11-27 09:17:48 | 4 | 2016-11-27 09:17:48 | 0 | 0.500000 |
| 8 | rate_update | 2016-11-27 09:55:07 | 4 | 2016-11-27 09:50:07 | 0 | 2.000000 |
| 9 | rate_update | 2016-11-27 09:55:07 | 2 | 2016-11-25 09:55:07 | 0 | 2.000000 |
+-------+-------------------+---------------------+-------------+---------------------+--------+----------+
The expected result should be:
+------------+--------------------+--------------------+-----------------------+
| sent_at | sum(submitted_msg) | sum(delivered_msg) | sum(rate_total) |
+------------+--------------------+--------------------+-----------------------+
| 2016-11-25 | 2 | 2 | 2.500000 |
| 2016-11-26 | 1 | 0 | 0.500000 |
| 2016-11-27 | 1 | 0 | 2.000000 |
+------------+--------------------+--------------------+-----------------------+
At the end of the post is the query that is used to obtain this result. I'm willing to bet that there should be a way to optimize it, since it's using subqueries with joins, and from what I have read about BigQuery, joins should best be avoided. But first some background:
In essence, the dataset represents an append-only table, to which multipe events are written. The size of the data is in the hundreds of millions and will grow to billions+. Since Updates in BigQuery are not practical, and the data is being streamed to BQ, I need a way to retrieve the most recent of each events, perform some calculations based on the certain conditions and return an accurate result. The query is generated dynamically, based on user input, so more fields/calculations can be included, but have been ommited for simplicity.
create
event, but n
of any other kindcreate
may not carry the rest of the information of the original/may not be accurate(except for message_id and the field that the event is operating on) (the dataset is simplified, but imagine there are many more columns, and more events will be added later)
rate_update
may or may not have the status field set, or be a value that is not the final, so no calculation can be made on the status field from a rate_update
event and the same goes for status_update
So I guess I have a couple of questions:
create
in their own tables, where the only fields available will be the ones relevant for the events, and needed for the joins(message_id, event_timestamp)? Will this reduce the amount of data processed?Actually any advice on how to query this dataset efficiently and friendly is more than welcome! Thank you! :)
The monstrosity I've come up with is the following. The INNER JOINS
are used to retrieve the latest version of each row, as per this resource
select
sent_at as sent_at,
sum(submitted_msg) as submitted,
sum(delivered_msg) as delivered,
sum(sales_rate_total) as sales_rate_total
FROM (
#DELIVERED
SELECT
d.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
sum(if(status=1,1,0)) as delivered_msg,
0 as sales_rate_total
FROM `events` d
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "status_update"
group by 1
) g on d.message_id = g.message_id and d.event_timestamp = g.ts
GROUP BY 1,2
UNION ALL
#SALES RATE
SELECT
s.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
0 as submitted_msg,
0 as delivered_msg,
sum(sales_rate) as sales_rate_total
FROM `events` s
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type in ("rate_update", "create")
group by 1
) f on s.message_id = f.message_id and s.event_timestamp = f.ts
GROUP BY 1,2
UNION ALL
#SUBMITTED & REST
SELECT
r.message_id,
FORMAT_TIMESTAMP('%Y-%m-%d 00:00:00', sent_at) AS sent_at,
sum(if(status=0,1,0)) as submitted_msg,
0 as delivered_msg,
0 as sales_rate_total
FROM `events` r
INNER JOIN
(
select message_id, max(event_timestamp) as ts
from `events`
where event_type = "create"
group by 1
) e on r.message_id = e.message_id and r.event_timestamp = e.ts
GROUP BY 1, 2
) k
group by 1
How can this query be optimized?
Try below version
#standardSQL
WITH types AS (
SELECT
FORMAT_TIMESTAMP('%Y-%m-%d', sent_at) AS sent_at,
message_id,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "create") DESC, event_timestamp DESC) AS submitted_status,
FIRST_VALUE(status) OVER(PARTITION BY message_id ORDER BY (event_type = "status_update") DESC, event_timestamp DESC) AS delivered_status,
FIRST_VALUE(rate) OVER(PARTITION BY message_id ORDER BY (event_type IN ("rate_update", "create")) DESC, event_timestamp DESC) AS sales_rate
FROM events
), latest AS (
SELECT
sent_at,
message_id,
ANY_VALUE(IF(submitted_status=0,1,0)) AS submitted,
ANY_VALUE(IF(delivered_status=1,1,0)) AS delivered,
ANY_VALUE(sales_rate) AS sales_rate
FROM types
GROUP BY 1, 2
)
SELECT
sent_at,
SUM(submitted) AS submitted,
SUM(delivered) AS delivered,
SUM(sales_rate) AS sales_rate_total
FROM latest
GROUP BY 1
It's compact enough to easily manage, no redundancy, no joins at all, etc.
If your table partitioned - you can easily use it by adjusting query just in one place
You can use below dummy data if want to check above query on low volume first
WITH events AS (
SELECT 1 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:17:48' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 2 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:24:38' AS event_timestamp, 1 AS message_id, TIMESTAMP '2016-11-25 09:28:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 3 AS id, 'create' AS event_type, TIMESTAMP '2016-11-25 09:47:48' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:47:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 4 AS id, 'status_update' AS event_type, TIMESTAMP '2016-11-25 09:54:38' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:48:49' AS sent_at, 1 AS status, 0.500000 AS rate UNION ALL
SELECT 5 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-25 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:50:07' AS sent_at, 0 AS status, 1.000000 AS rate UNION ALL
SELECT 6 AS id, 'create' AS event_type, TIMESTAMP '2016-11-26 09:17:48' AS event_timestamp, 3 AS message_id, TIMESTAMP '2016-11-26 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 7 AS id, 'create' AS event_type, TIMESTAMP '2016-11-27 09:17:48' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:17:48' AS sent_at, 0 AS status, 0.500000 AS rate UNION ALL
SELECT 8 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 4 AS message_id, TIMESTAMP '2016-11-27 09:50:07' AS sent_at, 0 AS status, 2.000000 AS rate UNION ALL
SELECT 9 AS id, 'rate_update' AS event_type, TIMESTAMP '2016-11-27 09:55:07' AS event_timestamp, 2 AS message_id, TIMESTAMP '2016-11-25 09:55:07' AS sent_at, 0 AS status, 2.000000 AS rate
)