Speed up nested table query

I have a table with a structure like this

CREATE TABLE activities (
    id LONG,
    timestamp TIMESTAMP,
    sourceId LONG,
    sourceParentId LONG,
    conversationId LONG,
    deletedAt TIMESTAMP,
    activity_type SYMBOL CAPACITY 256 CACHE
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;

I have some parent activities (they have a sourceId but not a sourceParentId) and some child activities.

I want to get the children activities which meet a few conditions, together with a column from the parent row. My current working query is this:

WITH
  activities_to_check_for_parentId AS (
    SELECT *
    FROM activities child
    WHERE deletedAt IS NULL 
    AND sourceParentId IS NOT NULL
    AND conversationId IS NULL 
  )
SELECT 
  parent.id,
  child.*
FROM activities parent
JOIN activities_to_check_for_parentId child 
ON parent.sourceId = child.sourceParentId 
WHERE parent.sourceParentId is NULL

This works and gives me what I need, but it is taking over 20 seconds in a 100 million rows dataset, using an EC2 instance with 16 CPUs and maxed out gp3 volume. I am looking for any suggestions to make the query faster, other than filtering by timestamp.

Solution

This type of queries can vary a lot depending on the data distribution, but very likely we can improve the query by using in this case a window function rather than a join. We can simulate some data to populate the table using row generators.

INSERT BATCH 50000000 INTO activities
SELECT
    x AS id,
    timestamp_sequence(
        to_timestamp('2024-01-01', 'yyyy-MM-dd'),
        100000L  -- 100 milliseconds per row for incremental timestamps
    ) AS timestamp,
    x AS sourceId,  -- Unique `sourceId` for up to 1 billion rows
    CASE WHEN x > 100000 THEN x - rnd_int(1, 99999, 0) ELSE NULL END AS sourceParentId, -- Ensures parent exists before child
    CASE WHEN rnd_double() > 0.5 THEN rnd_long(1, 1000, 0) ELSE NULL END AS conversationId, -- Randomized with a 50% chance
    CASE WHEN rnd_double() > 0.9 THEN timestamp_sequence(to_timestamp('2025-01-01', 'yyyy-MM-dd'), 86400000000000) ELSE NULL END AS deletedAt, -- 10% chance for deletion
    rnd_symbol('message', 'comment', 'reaction', 'update', 'task') AS activity_type -- Random activity type
FROM long_sequence(1000000000, 123456789, 987654321) x;

In this case I am generating a dataset with 1,000,000,000 rows, and with 100K different parentIds. Note this is pseudo random, so real life distribution might change quite a bit.

After a few minutes the table is created and ready to be queried.


WITH act AS (
select timestamp, id, sourceId, 
coalesce(sourceParentId, sourceId) as sourceParentId, 
conversationId, deletedAt, activity_type
FROM activities
WHERE sourceParentId is NULL OR
(  deletedAt IS NULL 
    AND sourceParentId IS NOT NULL
    AND conversationId IS NULL
) 
)
select first_value(id) over(partition by sourceParentId order by timestamp), 
* from act
ORDER BY timestamp ASC;

What I am doing here is to use a window function to PARTITION rows by parentId. I need to include the parent in the group (so I can get the extra column we want), so I am doing coalesce(sourceParentId, sourceId) as sourceParentId. Now I can already use the first_value window function to get the id column from the first row in each group, which should be the parent (as timestamp will always be earlier than any children).

On a 16 cpu EC2 instance this is taking ~5.8 seconds in the cold run, and 800ms in the hot run after data has been lifted from disk in subsequent queries.