Search code examples
sqlmysqlquery-optimizationrecursive-querydate-arithmetic

Round timestamps to the nearest half hour without omitting missing data


Situation:

Imagine some script that, upon completion, inserts the current timestamp and some other data into a MySQL table. It's executed every thirty minutes, sometimes not at all, leaving gaps in the data.

Goal:

Have a query that fetches all data with the timestamp rounded to the closest half hour and empty rows (all fields except for the timestamp should be null) when there's no data.

Restrictions:

Neither the table structure, the data itself nor the script can be altered.

Problem:

The only solution I could come up with that yields the desired result doesn't scale. Currently the actual table counts about 50'000 rows and it already takes more than fifteen minutes to complete the query.

Example:

CREATE TABLE IF NOT EXISTS `statuses` (
    `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    `status` INT NOT NULL DEFAULT '0',
    PRIMARY KEY (`timestamp`)
);
INSERT
    IGNORE INTO `statuses` (`timestamp`, `status`)
VALUES
    ('2023-01-01 00:03:34', '164850'),
    ('2023-01-01 00:31:23', '794088'),
    ('2023-01-01 03:31:28', '686754'),
    ('2023-01-01 04:01:15', '684711'),
    ('2023-01-01 05:31:35', '116777'),
    ('2023-01-01 06:01:52', '469332'),
    ('2023-01-01 06:31:55', '816300'),
    ('2023-01-01 08:33:53', '309583'),
    ('2023-01-01 09:03:54', '847976'),
    ('2023-01-01 09:31:33', '812517');
WITH RECURSIVE `timestamps`(`timestamp`) AS (
    SELECT
        (
            SELECT
                FROM_UNIXTIME(
                    UNIX_TIMESTAMP(MIN(`timestamp`)) - MOD(UNIX_TIMESTAMP(MIN(`timestamp`)), 1800)
                )
            FROM
                `statuses`
        )
    UNION
    ALL
    SELECT
        DATE_ADD(`timestamp`, INTERVAL 30 MINUTE)
    FROM
        `timestamps`
    WHERE
        `timestamp` < (
            SELECT
                FROM_UNIXTIME(
                    UNIX_TIMESTAMP(MAX(`timestamp`)) - MOD(UNIX_TIMESTAMP(MAX(`timestamp`)), 1800)
                )
            FROM
                `statuses`
        )
)
SELECT
    `t`.`timestamp`,
    `s`.`status`
FROM
    `timestamps` AS `t`
    LEFT OUTER JOIN `statuses` AS `s` ON `t`.`timestamp` = FROM_UNIXTIME(
        UNIX_TIMESTAMP(`s`.`timestamp`) - MOD(UNIX_TIMESTAMP(`s`.`timestamp`), 1800)
    )
ORDER BY
    `t`.`timestamp` ASC;

Solution

  • We can simplify the recursive part a little. As Thorsten Kettner mentions, there is no need to reselect the max status date on each iteration, we can do that in the anchor - and I would also argue that we don't even need to round the max date (although that's micro-optimization).

    When it comes to the outer query, I would not recommend applying functions on the status timestamp; this is the primary key of the table, against which we do want to run a SARGEable predicate: let's use a half-open interval instead (and we don't need to perform the unixtime conversion again here):

    with recursive timestamps (ts, max_ts) as (
        select from_unixtime(floor(unix_timestamp(min(ts)) / 1800) * 1800) ts, max(ts) max_ts
        from statuses
        union all
        select ts + interval 30 minute, max_ts from timestamps where ts + interval 30 minute <= max_ts
    )
    select t.ts, s.status
    from timestamps t
    left join statuses s on s.ts >= t.ts and s.ts < t.ts + interval 30 minute
    order by t.ts
    

    Note that I renamed column timestamp to ts, to avoid clashing with the corresponding SQL keyword.


    If that's not good enough, then one alternative would be to materialize the results of the recursive query in a calendar table.

    You would typically cover a large period of time with the table. You can use the recursive query to create it:

    create table timestamp_calendar as 
    with recursive timestamps as (
        select '2022-01-01 00:00:00' ts, '2022-01-02 12:00:00' max_ts -- short period for testing
        union all
        select ts + interval 30 minute, max_ts 
        from timestamps
        where ts < max_ts
    )
    select * from timestamps;
    

    We can declare a primary key to benefit the underlying index:

    alter table timestamp_calendar add primary key (ts);
    

    Then we can use the table in our query. Ideally, you know in advance which date range you are looking for. But if you don't, we can bring the min/max status dates and use them to pre-filter the calendar table.

    select t.ts, s.status
    from timestamp_calendar t
    inner join (select min(ts) min_ts, max(ts) max_ts from statuses) x 
        on  t.ts >  x.min_ts - interval 30 minute 
        and t.ts <= x.max_ts
    left join statuses s 
        on  s.ts >= t.ts 
        and s.ts <  t.ts + interval 30 minute
    order by t.ts
    

    Here is a demo on DB Fiddle