Search code examples
mysqlquery-optimization

Optimising MySQL query with date range and join


I have the following query:

SELECT COUNT(*)
  FROM datum d

  JOIN datum_type dt
    ON dt.datum_id = d.id
   AND dt.type_id = '3' 

 WHERE d.added_time >=  DATE_FORMAT(CURDATE(), '%Y-%m')
   AND d.added_time <   DATE_FORMAT(CURDATE() + INTERVAL 1 MONTH, '%Y-%m')

There are indexes on d.id (PRIMARY), d.added_time, dt.datum_id and dt.type_id

The current explain plan is:

+----+-------------+-------+--------+--------------------+---------+---------+-------------+--------+-------------+
| id | select_type | table |  type  |   possible_keys    |   key   | key_len |     ref     |  rows  |    Extra    |
+----+-------------+-------+--------+--------------------+---------+---------+-------------+--------+-------------+
|  1 | SIMPLE      | dt    | ref    | type_id,datum_id   | type_id |       1 | const       | 602628 |             |
|  1 | SIMPLE      | d     | eq_ref | PRIMARY,added_time | PRIMARY |       8 | dt.datum_id |      1 | Using where |
+----+-------------+-------+--------+--------------------+---------+---------+-------------+--------+-------------+

As we have datum records for quite some time, it appears to be joining the type in first using the datum.id PRIMARY and then scanning each joined row to see if the datum.added_time is within the range.

I tried using the added_time index but the explain plan was:

+----+-------------+-------+-------+------------------+------------+---------+------+---------+--------------------------+
| id | select_type | table | type  |  possible_keys   |    key     | key_len | ref  |  rows   |          Extra           |
+----+-------------+-------+-------+------------------+------------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | d     | index | added_time       | added_time |       4 | NULL | 6195194 | Using where; Using index |
|  1 | SIMPLE      | dt    | ref   | type_id,datum_id | datum_id   |       8 | d.id |       1 | Using where              |
+----+-------------+-------+-------+------------------+------------+---------+------+---------+--------------------------+  

Which takes almost as long as there are so many datum_types of different datum_type.type_id within the datum.added_time range.

Is there some combination of index that may speed this up?


Solution

  • Let me assume that added_time is datetime or date. Then, you should express the conditions as strings. Instead, use date constants:

    SELECT COUNT(*)
    FROM datum d JOIN
         datum_type dt
         ON dt.datum_id = d.id AND
            dt.type_id = '3' 
    WHERE d.added_time >= DATE_SUB(CURDATE(), INTERVAL DAY(CURDATE()) - 1 DAY) AND
          d.added_time < DATE_ADD(DATE_SUB(CURDATE(), INTERVAL DAY(CURDATE()) - 1 DAY), INTERVAL 1 MONTH);
    

    This can take advantage of an index on datum(added_time, id) and datum_type(datum_id, type_id).

    If there are no duplicate records (for the count) coming from datum_type, I would suggest that you rewrite the query as:

    SELECT COUNT(*)
    FROM datum d
    WHERE d.added_time >= DATE_SUB(CURDATE(), INTERVAL DAY(CURDATE()) - 1 DAY) AND
          d.added_time < DATE_ADD(DATE_SUB(CURDATE(), INTERVAL DAY(CURDATE()) - 1 DAY), INTERVAL 1 MONTH) AND
          EXISTS (SELECT 1
                  FROM datum_type dt
                  WHERE dt.datum_id = d.id AND dt.type_id = '3'
                 );
    

    If type_id is an integer, then you should drop the single quotes. Mixing different data types in SQL can confuse the optimize and prevent the use of indexes.