Imagine I have a table like this:
CREATE TABLE `Alarms` (
`AlarmId` INT UNSIGNED NOT NULL AUTO_INCREMENT
COMMENT "32-bit ID",
`Ended` BOOLEAN NOT NULL DEFAULT FALSE
COMMENT "Whether the alarm has ended",
`StartedAt` TIMESTAMP NOT NULL DEFAULT 0
COMMENT "Time at which the alarm was raised",
`EndedAt` TIMESTAMP NULL
COMMENT "Time at which the alarm ended (NULL iff Ended=false)",
PRIMARY KEY (`AlarmId`),
KEY `Key4` (`StartedAt`),
KEY `Key5` (`Ended`, `EndedAt`)
) ENGINE=InnoDB;
Now, for a GUI, I want to produce:
The intent is to present users with a dropdown box from which they can choose a date to see any alarms active (started before or during, and ended during or after) on that day. So something like this:
+-----------------------------------+
| Choose day ▼ |
+-----------------------------------+
| 2017-12-03 (3 started) |
| 2017-12-04 (1 started, 2 ended) |
| 2017-12-05 (2 ended) |
| 2017-12-16 (1 started, 1 ended) |
| 2017-12-17 (1 started) |
| 2017-12-18 |
| 2017-12-19 |
| 2017-12-20 |
| 2017-12-21 (1 ended) |
+-----------------------------------+
I will probably force an age limit on alarms so that they are archived/removed after, say, a year. So that's the scale we're working with.
I expect anywhere from zero to tens of thousands of alarms per day.
My first thought was a reasonably simple:
(
SELECT
COUNT(`AlarmId`) AS `NumStarted`,
NULL AS `NumEnded`,
DATE(`StartedAt`) AS `Date`
FROM `Alarms`
GROUP BY `Date`
)
UNION
(
SELECT
NULL AS `NumStarted`,
COUNT(`AlarmId`) AS `NumEnded`,
DATE(`EndedAt`) AS `Date`
FROM `Alarms`
WHERE `Ended` = TRUE
GROUP BY `Date`
);
This uses both of my indexes, with join type ref
and ref type const
, which I'm happy with. I can iterate over the resultset, dumping the non-NULL
values found into a C++ std::map<boost::gregorian::date, std::pair<size_t, size_t>>
(then "filling the gaps" for days on which no alarms started or ended, but were active from previous days).
The spanner I'm throwing in the works is that the list should take into account location-based timezones, but only my application knows about timezones. For logistical reasons, the MySQL session is deliberately SET time_zone = '+00:00'
so that timestamps are all kicked out in UTC. (Various other tools are then used to perform any necessary location-specific corrections for historical timezones, taking into account DST and whatnot.) For the rest of the application this is great, but for this particular query it breaks the date GROUP
ing.
Maybe I could pre-calculate (in my application) a list of time ranges, and generate a huge query of 2n UNION
ed queries (where n = number of "days" to check) and get the NumStarted
and NumEnded
counts that way:
-- Example assuming desired timezone is -05:00
--
-- 3rd December
(
SELECT
COUNT(`AlarmId`) AS `NumStarted`,
NULL AS `NumEnded`,
'2017-12-03' AS `Date`
FROM `Alarms`
-- Alarm started during 3rd December UTC-5
WHERE `StartedAt` >= '2017-12-02 19:00:00'
AND `StartedAt` < '2017-12-03 19:00:00'
GROUP BY `Date`
)
UNION
(
SELECT
NULL AS `NumStarted`,
COUNT(`AlarmId`) AS `NumEnded`,
'2017-12-03' AS `Date`
FROM `Alarms`
-- Alarm ended during 3rd December UTC-5
WHERE `EndedAt` >= '2017-12-02 19:00:00'
AND `EndedAt` < '2017-12-03 19:00:00'
GROUP BY `Date`
)
UNION
-- 4th December
(
SELECT
COUNT(`AlarmId`) AS `NumStarted`,
NULL AS `NumEnded`,
'2017-12-04' AS `Date`
FROM `Alarms`
-- Alarm started during 4th December UTC-5
WHERE `StartedAt` >= '2017-12-03 19:00:00'
AND `StartedAt` < '2017-12-04 19:00:00'
GROUP BY `Date`
)
UNION
(
SELECT
NULL AS `NumStarted`,
COUNT(`AlarmId`) AS `NumEnded`,
'2017-12-04' AS `Date`
FROM `Alarms`
-- Alarm ended during 4th December UTC-5
WHERE `EndedAt` >= '2017-12-03 19:00:00'
AND `EndedAt` < '2017-12-04 19:00:00'
GROUP BY `Date`
)
UNION
-- 5th December
-- [..]
But, of course, even if I'm restricting the database to a year's worth of historical alarms, that's up to like 730 UNION
d SELECT
s. My spidey senses tell me that this is a very bad idea.
How else can I generate these sort of time-grouped statistics? Or is this really silly and I should look at resolving the problems preventing me from using tzinfo with MySQL?
Must work on MySQL 5.1.73 (CentOS 6) and MariaDB 5.5.50 (CentOS 7).
The UNION
approach is actually not far off a viable solution; you can achieve the same thing, without a catastrophically large query, by recruiting a temporary table:
CREATE TEMPORARY TABLE `_ranges` (
`Start` TIMESTAMP NOT NULL DEFAULT 0,
`End` TIMESTAMP NOT NULL DEFAULT 0,
PRIMARY KEY (`Start`, `End`)
);
INSERT INTO `_ranges` VALUES
-- 3rd December UTC-5
('2017-12-02 19:00:00', '2017-12-03 19:00:00'),
-- 4th December UTC-5
('2017-12-03 19:00:00', '2017-12-04 19:00:00'),
-- 5th December UTC-5
('2017-12-04 19:00:00', '2017-12-05 19:00:00'),
-- etc.
;
-- Now the queries needed are simple and also quick:
SELECT
`_ranges`.`Start`,
COUNT(`AlarmId`) AS `NumStarted`
FROM `_ranges` LEFT JOIN `Alarms`
ON `Alarms`.`StartedAt` >= `_ranges`.`Start`
ON `Alarms`.`StartedAt` < `_ranges`.`End`
GROUP BY `_ranges`.`Start`;
SELECT
`_ranges`.`Start`,
COUNT(`AlarmId`) AS `NumEnded`
FROM `_ranges` LEFT JOIN `Alarms`
ON `Alarms`.`EndedAt` >= `_ranges`.`Start`
ON `Alarms`.`EndedAt` < `_ranges`.`End`
GROUP BY `_ranges`.`Start`;
DROP TABLE `_ranges`;
(This approach was inspired by a DBA.SE post.)
Notice that there are two SELECT
s — the original UNION
is no longer possible, because temporary tables cannot be accessed twice in the same query. However, since we've already introduced additional statements anyway (the CREATE
, INSERT
and DROP
), this seems to be a moot problem in the circumstances.
In both cases, each row represents one of our requested periods, and the first column equals the "start" part of the period (so that we can identify it in the resultset).
Be sure to use exception handling in your code as needed to ensure that _ranges
is DROP
ped before your routine returns; although the temporary table is local to the MySQL session, if you're continuing to use that session afterwards then you probably want a clean state, particularly if this function is going to be used again.
If this is still too heavy, for example because you have many time periods and the CREATE TEMPORARY TABLE
itself will therefore become too large, or because multiple statements doesn't fit in your calling code, or because your user doesn't have permission to create and drop temporary tables, you'll have to fall back on a simple GROUP BY
over DAY(Date)
, and ensure that your users run mysql_tzinfo_to_sql
whenever the system's tzdata is updated.