Search code examples
databasetime-seriesquestdb

Speed up nested table query


I have a table with a structure like this

CREATE TABLE activities (
    id LONG,
    timestamp TIMESTAMP,
    sourceId LONG,
    sourceParentId LONG,
    conversationId LONG,
    deletedAt TIMESTAMP,
    activity_type SYMBOL CAPACITY 256 CACHE
) TIMESTAMP(timestamp) PARTITION BY DAY WAL;

I have some parent activities (they have a sourceId but not a sourceParentId) and some child activities.

I want to get the children activities which meet a few conditions, together with a column from the parent row. My current working query is this:

WITH
  activities_to_check_for_parentId AS (
    SELECT *
    FROM activities child
    WHERE deletedAt IS NULL 
    AND sourceParentId IS NOT NULL
    AND conversationId IS NULL 
  )
SELECT 
  parent.id,
  child.*
FROM activities parent
JOIN activities_to_check_for_parentId child 
ON parent.sourceId = child.sourceParentId 
WHERE parent.sourceParentId is NULL 

This works and gives me what I need, but it is taking over 20 seconds in a 100 million rows dataset, using an EC2 instance with 16 CPUs and maxed out gp3 volume. I am looking for any suggestions to make the query faster, other than filtering by timestamp.


Solution

  • This type of queries can vary a lot depending on the data distribution, but very likely we can improve the query by using in this case a window function rather than a join. We can simulate some data to populate the table using row generators.

    INSERT BATCH 50000000 INTO activities
    SELECT
        x AS id,
        timestamp_sequence(
            to_timestamp('2024-01-01', 'yyyy-MM-dd'),
            100000L  -- 100 milliseconds per row for incremental timestamps
        ) AS timestamp,
        x AS sourceId,  -- Unique `sourceId` for up to 1 billion rows
        CASE WHEN x > 100000 THEN x - rnd_int(1, 99999, 0) ELSE NULL END AS sourceParentId, -- Ensures parent exists before child
        CASE WHEN rnd_double() > 0.5 THEN rnd_long(1, 1000, 0) ELSE NULL END AS conversationId, -- Randomized with a 50% chance
        CASE WHEN rnd_double() > 0.9 THEN timestamp_sequence(to_timestamp('2025-01-01', 'yyyy-MM-dd'), 86400000000000) ELSE NULL END AS deletedAt, -- 10% chance for deletion
        rnd_symbol('message', 'comment', 'reaction', 'update', 'task') AS activity_type -- Random activity type
    FROM long_sequence(1000000000, 123456789, 987654321) x;
    

    In this case I am generating a dataset with 1,000,000,000 rows, and with 100K different parentIds. Note this is pseudo random, so real life distribution might change quite a bit.

    After a few minutes the table is created and ready to be queried.

    
    WITH act AS (
    select timestamp, id, sourceId, 
    coalesce(sourceParentId, sourceId) as sourceParentId, 
    conversationId, deletedAt, activity_type
    FROM activities
    WHERE sourceParentId is NULL OR
    (  deletedAt IS NULL 
        AND sourceParentId IS NOT NULL
        AND conversationId IS NULL
    ) 
    )
    select first_value(id) over(partition by sourceParentId order by timestamp), 
    * from act
    ORDER BY timestamp ASC;
    

    What I am doing here is to use a window function to PARTITION rows by parentId. I need to include the parent in the group (so I can get the extra column we want), so I am doing coalesce(sourceParentId, sourceId) as sourceParentId. Now I can already use the first_value window function to get the id column from the first row in each group, which should be the parent (as timestamp will always be earlier than any children).

    On a 16 cpu EC2 instance this is taking ~5.8 seconds in the cold run, and 800ms in the hot run after data has been lifted from disk in subsequent queries.