I have a table with a structure like this
CREATE TABLE activities (
id LONG,
timestamp TIMESTAMP,
sourceId LONG,
sourceParentId LONG,
conversationId LONG,
deletedAt TIMESTAMP,
activity_type SYMBOL CAPACITY 256 CACHE
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;
I have some parent activities (they have a sourceId but not a sourceParentId) and some child activities.
I want to get the children activities which meet a few conditions, together with a column from the parent row. My current working query is this:
WITH
activities_to_check_for_parentId AS (
SELECT *
FROM activities child
WHERE deletedAt IS NULL
AND sourceParentId IS NOT NULL
AND conversationId IS NULL
)
SELECT
parent.id,
child.*
FROM activities parent
JOIN activities_to_check_for_parentId child
ON parent.sourceId = child.sourceParentId
WHERE parent.sourceParentId is NULL
This works and gives me what I need, but it is taking over 20 seconds in a 100 million rows dataset, using an EC2 instance with 16 CPUs and maxed out gp3 volume. I am looking for any suggestions to make the query faster, other than filtering by timestamp.
This type of queries can vary a lot depending on the data distribution, but very likely we can improve the query by using in this case a window function rather than a join. We can simulate some data to populate the table using row generators.
INSERT BATCH 50000000 INTO activities
SELECT
x AS id,
timestamp_sequence(
to_timestamp('2024-01-01', 'yyyy-MM-dd'),
100000L -- 100 milliseconds per row for incremental timestamps
) AS timestamp,
x AS sourceId, -- Unique `sourceId` for up to 1 billion rows
CASE WHEN x > 100000 THEN x - rnd_int(1, 99999, 0) ELSE NULL END AS sourceParentId, -- Ensures parent exists before child
CASE WHEN rnd_double() > 0.5 THEN rnd_long(1, 1000, 0) ELSE NULL END AS conversationId, -- Randomized with a 50% chance
CASE WHEN rnd_double() > 0.9 THEN timestamp_sequence(to_timestamp('2025-01-01', 'yyyy-MM-dd'), 86400000000000) ELSE NULL END AS deletedAt, -- 10% chance for deletion
rnd_symbol('message', 'comment', 'reaction', 'update', 'task') AS activity_type -- Random activity type
FROM long_sequence(1000000000, 123456789, 987654321) x;
In this case I am generating a dataset with 1,000,000,000 rows, and with 100K different parentIds. Note this is pseudo random, so real life distribution might change quite a bit.
After a few minutes the table is created and ready to be queried.
WITH act AS (
select timestamp, id, sourceId,
coalesce(sourceParentId, sourceId) as sourceParentId,
conversationId, deletedAt, activity_type
FROM activities
WHERE sourceParentId is NULL OR
( deletedAt IS NULL
AND sourceParentId IS NOT NULL
AND conversationId IS NULL
)
)
select first_value(id) over(partition by sourceParentId order by timestamp),
* from act
ORDER BY timestamp ASC;
What I am doing here is to use a window function to PARTITION rows by parentId. I need to include the parent in the group (so I can get the extra column we want), so I am doing coalesce(sourceParentId, sourceId) as sourceParentId
. Now I can already use the first_value
window function to get the id
column from the first row in each group, which should be the parent (as timestamp will always be earlier than any children).
On a 16 cpu EC2 instance this is taking ~5.8 seconds in the cold run, and 800ms in the hot run after data has been lifted from disk in subsequent queries.