I have the following query that's running pretty slow when executing it on thousands of records.
SELECT
name,
id
FROM
meetings
WHERE
meeting_date < '2014-09-20 11:00:00' AND (
meeting_date >= '2014-09-20 09:00:00' OR
DATE_ADD(meeting_date, INTERVAL meeting_length SECOND) > '2014-09-20 09:00:00'
)
The query checks if meeting_date
overlaps in anyway between 2014-09-20 09:00:00
and 2014-09-20 11:00:00
. The above query covers all the possible overlapping cases. However, DATE_ADD
adds a lot of overhead.
Anyway to optimize DATE_ADD
? Removing DATE_ADD greatly boosts the performance but it won't cover all overlapping cases.
I recommend you eliminate the OR
.
MySQL won't (can't) perform a range scan operation on an index on column meeting_date
when that column is wrapped in a function.
When the comparison is against the bare column, MySQL can do a range scan. But with the comparison to an expression, MySQL has to evaluate that expression for every row in the table, and then comapare.
For a large table, we'd get optimal performance with an index with leading column of meeting_date
.
I think the "trick" to getting better performance is to rewrite the query to introduce some additional domain knowledge. Specifically, what are the MINIMUM and MAXIMUM values for meeting_length
?
I think it's pretty safe to assume it won't be negative. And we probably don't expect it to be zero. But even if the minimum length is greater than zero, we can use zero as our "known" minimum. (It's going to turn out to be more convenient than some other non-zero value.)
What we really need to know is the MAXIMUM value for meeting_length
. If that's a known constant value, that would be great, because we're going to include that value in the query. let's assume the maximum value of meeting_length
is the number of seconds in 7 days.
As a demonstration of what I'm thinking:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00' + INTERVAL -7 DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Let's unwrap that a bit.
The first predicate is the same as in your original query... the "start" time of the meeting is before the "end" of the specified period.
The third predicate is the same as in your query too... the "end" of the meeting is after the beginning of the specified period. (My personal preference is to use the + INTERVAL
form to add a duration to datetime.)
So, just like the original query we're looking for overlap.
I'm suggesting that we include another sargable predicate. The addition of this predicate doesn't really change the check for the overlap, given that we have a known minimum of 0 for meeting_length. What it does do is add a fixed lower bound that we can check against.
To explain that a little bit... if a meeting row that satisfies the condition "meeting end is after the period start", then we also know, for that row, that "meeting start is after (period start MINUS meeting length)". And we also know that "meeting start is after (period start MINUS the MAXIMUM possible value of meeting length.
And for most rows, that's going to be a bigger range... but the "trick" is the the predicate that checks that can compare a "bare" column against a constant.
And that means MySQL will be able to use an index range scan operation to satisfy that. The query is of the form:
WHERE meeting_date > const
AND meeting_date < const
And that's perfect for an index range scan. That should benefit performance... assuming there's a suitable index and that significantly limits the number of rows that need to be checked.
But by itself, that returns more rows than we need, we're going to get some meetings that start and end before the start of the period.
So we still need the additional check, to further filter down the rows. But that won't have to be evaluated for every row, only the rows that are pass through the first two predicates.
AND meeting_date + length > const
We just need to MySQL to recognize that it length
won't ever be negative; to recognize that this is actually a "stricter" range, not a broader range. It might work with the AND
, but we can force MySQL to evaluate that condition later, by including it in the HAVING
clause.
HAVING meeting_date + length > const
But, all of that is really just a guess.
We'd really need to take a look at the EXPLAIN output.
If that index with the leading column of meeting_date also includes the id and name columns, then MySQL could satisfy the query entirely from the index, without any need to reference pages in the underlying table. (If that happens, we'll see "Using index" in the EXPLAIN output.)
Earlier, I said it would be convenient if we had a known constant for maximum meeting_length
.
We could also use a query to determine that from the data:
SELECT MAX(meeting_length) FROM meetings
(And index with meeting_length as the leading column will avoid having to do an expensive full scan of the table)
We use that value to derive the "constant" value in the predicate.
We could include that query (as an inline view or a subquery), but that might impact performance. (We'd need to test how "smart" MySQL optimizer is...
We could try it as a subquery:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL (SELECT MAX(l.meeting_length) FROM meetings l) DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Or try it as an inline view:
SELECT m.name
, m.id
FROM ( SELECT MAX(l.meeting_length) AS max_seconds
FROM meetings l
) d
CROSS
JOIN meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL d.max_seconds SECOND
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'