Search code examples
sqlpostgresqlquery-optimization

SQL to group time intervals by arbitrary time period


I need help with this SQL query. I have a big table with the following schema:

  • time_start (timestamp) - start time of the measurement,
  • duration (double) - duration of the measurement in seconds,
  • count_event1 (int) - number of measured events of type 1,
  • count_event2 (int) - number of measured events of type 2

I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.

I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:

SELECT
    ROUND(time_start/group_period,0) AS time_period,
    SUM(count_event1) AS sum_event1,
    SUM(count_event2) AS sum_event2 
FROM measurements
GROUP BY time_period;

However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?

Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?


Solution

  • Based on Timekiller's advice, I have come up with the following query:

    -- Since there's a problem with declaring variables in PostgreSQL,
    -- we will be using aliases for the arguments required by the script.
    
    -- First some configuration:
    --   group_period = 3600   -- group by 1 hour (= 3600 seconds)
    --   min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
    --   max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
    
    -- Calculate the number of started periods in the given interval in advance.
    --   period_count = CEIL((max_time - min_time) / group_period)
    
    SET TIME ZONE UTC;
    BEGIN TRANSACTION;
    
    -- Create a temporary table and fill it with all time periods.
    CREATE TEMP TABLE periods (period_start TIMESTAMP)
        ON COMMIT DROP;
    INSERT INTO periods (period_start)
        SELECT to_timestamp(min_time + group_period * coefficient)
        FROM generate_series(0, period_count) as coefficient;
    
    -- Group data by the time periods.
    -- Note that we don't require exact overlap of intervals:
    --   A. [period_start, period_start + group_period]
    --   B. [time_start, time_start + duration]
    -- This would yield the best possible result but it would also slow
    -- down the query significantly because of the part B.
    -- We require only: period_start <= time_start <= period_start + group_period
    SELECT
        period_start,
        COUNT(measurements.*) AS count_measurements,
        SUM(count_event1) AS sum_event1,
        SUM(count_event2) AS sum_event2
    FROM periods
    LEFT JOIN measurements
    ON time_start BETWEEN period_start AND (period_start + group_period)
    GROUP BY period_start;
    
    COMMIT TRANSACTION;
    

    It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:

    • I expect the measurements table to have about 500-800 million rows.
    • The time_start column is primary key and has unique btree index on it.
    • I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.