Search code examples
mysqlgroup-bytimestamp-with-timezone

Counting rows in event table, grouped by time range, a lot


Imagine I have a table like this:

CREATE TABLE `Alarms` (
    `AlarmId` INT UNSIGNED NOT NULL AUTO_INCREMENT
        COMMENT "32-bit ID",

    `Ended` BOOLEAN NOT NULL DEFAULT FALSE
        COMMENT "Whether the alarm has ended",

    `StartedAt` TIMESTAMP NOT NULL DEFAULT 0
        COMMENT "Time at which the alarm was raised",

    `EndedAt` TIMESTAMP NULL
        COMMENT "Time at which the alarm ended (NULL iff Ended=false)",

    PRIMARY KEY (`AlarmId`),

    KEY `Key4` (`StartedAt`),
    KEY `Key5` (`Ended`, `EndedAt`)
) ENGINE=InnoDB;

Now, for a GUI, I want to produce:

  • a list of days during which at least one alarm were "active"
  • for each day, how many alarms started
  • for each day, how many alarms ended

The intent is to present users with a dropdown box from which they can choose a date to see any alarms active (started before or during, and ended during or after) on that day. So something like this:

+-----------------------------------+
| Choose day                      ▼ |
+-----------------------------------+
|   2017-12-03 (3 started)          |
|   2017-12-04 (1 started, 2 ended) |
|   2017-12-05 (2 ended)            |
|   2017-12-16 (1 started, 1 ended) |
|   2017-12-17 (1 started)          |
|   2017-12-18                      |
|   2017-12-19                      |
|   2017-12-20                      |
|   2017-12-21 (1 ended)            |
+-----------------------------------+

I will probably force an age limit on alarms so that they are archived/removed after, say, a year. So that's the scale we're working with.

I expect anywhere from zero to tens of thousands of alarms per day.

My first thought was a reasonably simple:

(
    SELECT
        COUNT(`AlarmId`) AS `NumStarted`,
        NULL AS `NumEnded`,
        DATE(`StartedAt`) AS `Date`
    FROM `Alarms`
    GROUP BY `Date`
)
UNION
(
    SELECT
        NULL AS `NumStarted`,
        COUNT(`AlarmId`) AS `NumEnded`,
        DATE(`EndedAt`) AS `Date`
    FROM `Alarms`
    WHERE `Ended` = TRUE
    GROUP BY `Date`
);

This uses both of my indexes, with join type ref and ref type const, which I'm happy with. I can iterate over the resultset, dumping the non-NULL values found into a C++ std::map<boost::gregorian::date, std::pair<size_t, size_t>> (then "filling the gaps" for days on which no alarms started or ended, but were active from previous days).

The spanner I'm throwing in the works is that the list should take into account location-based timezones, but only my application knows about timezones. For logistical reasons, the MySQL session is deliberately SET time_zone = '+00:00' so that timestamps are all kicked out in UTC. (Various other tools are then used to perform any necessary location-specific corrections for historical timezones, taking into account DST and whatnot.) For the rest of the application this is great, but for this particular query it breaks the date GROUPing.

Maybe I could pre-calculate (in my application) a list of time ranges, and generate a huge query of 2n UNIONed queries (where n = number of "days" to check) and get the NumStarted and NumEnded counts that way:

-- Example assuming desired timezone is -05:00
-- 
-- 3rd December
(
    SELECT
        COUNT(`AlarmId`) AS `NumStarted`,
        NULL AS `NumEnded`,
        '2017-12-03' AS `Date`
    FROM `Alarms`
    -- Alarm started during 3rd December UTC-5
    WHERE `StartedAt` >= '2017-12-02 19:00:00'
      AND `StartedAt` <  '2017-12-03 19:00:00'
    GROUP BY `Date`
)
UNION
(
    SELECT
        NULL AS `NumStarted`,
        COUNT(`AlarmId`) AS `NumEnded`,
        '2017-12-03' AS `Date`
    FROM `Alarms`
    -- Alarm ended during 3rd December UTC-5
    WHERE `EndedAt` >= '2017-12-02 19:00:00'
      AND `EndedAt` <  '2017-12-03 19:00:00'
    GROUP BY `Date`
)
UNION

-- 4th December
(
    SELECT
        COUNT(`AlarmId`) AS `NumStarted`,
        NULL AS `NumEnded`,
        '2017-12-04' AS `Date`
    FROM `Alarms`
    -- Alarm started during 4th December UTC-5
    WHERE `StartedAt` >= '2017-12-03 19:00:00'
      AND `StartedAt` <  '2017-12-04 19:00:00'
    GROUP BY `Date`
)
UNION
(
    SELECT
        NULL AS `NumStarted`,
        COUNT(`AlarmId`) AS `NumEnded`,
        '2017-12-04' AS `Date`
    FROM `Alarms`
    -- Alarm ended during 4th December UTC-5
    WHERE `EndedAt` >= '2017-12-03 19:00:00'
      AND `EndedAt` <  '2017-12-04 19:00:00'
    GROUP BY `Date`
)
UNION

-- 5th December
-- [..]

But, of course, even if I'm restricting the database to a year's worth of historical alarms, that's up to like 730 UNIONd SELECTs. My spidey senses tell me that this is a very bad idea.

How else can I generate these sort of time-grouped statistics? Or is this really silly and I should look at resolving the problems preventing me from using tzinfo with MySQL?

Must work on MySQL 5.1.73 (CentOS 6) and MariaDB 5.5.50 (CentOS 7).


Solution

  • The UNION approach is actually not far off a viable solution; you can achieve the same thing, without a catastrophically large query, by recruiting a temporary table:

    CREATE TEMPORARY TABLE `_ranges` (
       `Start` TIMESTAMP NOT NULL DEFAULT 0,
       `End`   TIMESTAMP NOT NULL DEFAULT 0,
       PRIMARY KEY (`Start`, `End`)
    );
    
    INSERT INTO `_ranges` VALUES
       -- 3rd December UTC-5
       ('2017-12-02 19:00:00', '2017-12-03 19:00:00'),
       -- 4th December UTC-5
       ('2017-12-03 19:00:00', '2017-12-04 19:00:00'),
       -- 5th December UTC-5
       ('2017-12-04 19:00:00', '2017-12-05 19:00:00'),
       -- etc.
    ;
    
    -- Now the queries needed are simple and also quick:
    
    SELECT
       `_ranges`.`Start`,
       COUNT(`AlarmId`) AS `NumStarted`
    FROM `_ranges` LEFT JOIN `Alarms`
      ON `Alarms`.`StartedAt` >= `_ranges`.`Start`
      ON `Alarms`.`StartedAt` <  `_ranges`.`End`
    GROUP BY `_ranges`.`Start`;
    
    SELECT
       `_ranges`.`Start`,
       COUNT(`AlarmId`) AS `NumEnded`
    FROM `_ranges` LEFT JOIN `Alarms`
      ON `Alarms`.`EndedAt` >= `_ranges`.`Start`
      ON `Alarms`.`EndedAt` <  `_ranges`.`End`
    GROUP BY `_ranges`.`Start`;
    
    DROP TABLE `_ranges`;
    

    (This approach was inspired by a DBA.SE post.)

    Notice that there are two SELECTs — the original UNION is no longer possible, because temporary tables cannot be accessed twice in the same query. However, since we've already introduced additional statements anyway (the CREATE, INSERT and DROP), this seems to be a moot problem in the circumstances.

    In both cases, each row represents one of our requested periods, and the first column equals the "start" part of the period (so that we can identify it in the resultset).

    Be sure to use exception handling in your code as needed to ensure that _ranges is DROPped before your routine returns; although the temporary table is local to the MySQL session, if you're continuing to use that session afterwards then you probably want a clean state, particularly if this function is going to be used again.

    If this is still too heavy, for example because you have many time periods and the CREATE TEMPORARY TABLE itself will therefore become too large, or because multiple statements doesn't fit in your calling code, or because your user doesn't have permission to create and drop temporary tables, you'll have to fall back on a simple GROUP BY over DAY(Date), and ensure that your users run mysql_tzinfo_to_sql whenever the system's tzdata is updated.