mysql optimization indexing create-table

Optimizing MySQL indexes for query (trading tick data database)

My MySQL database has over 350 million rows, and is growing. It's 32GB in size right now. I am using SSD's and lots of RAM, but would like to seek advice to make sure I am using appropriate indexes.

CREATE TABLE `qcollector` (
  `key` bigint(20) NOT NULL AUTO_INCREMENT,
  `instrument` char(4) DEFAULT NULL,
  `datetime` datetime DEFAULT NULL,
  `last` double DEFAULT NULL,
  `lastsize` int(10) DEFAULT NULL,
  `totvol` int(10) DEFAULT NULL,
  `bid` double DEFAULT NULL,
  `ask` double DEFAULT NULL,
  PRIMARY KEY (`key`),
  KEY `datetime_index` (`datetime`)
) ENGINE=InnoDB;

show index from qcollector;
+------------+------------+----------------+--------------+-------------+-----------+--    -----------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name       | Seq_in_index | Column_name | Collation |     Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| qcollector |          0 | PRIMARY        |            1 | key         | A         |   378866659 |     NULL | NULL   |      | BTREE      |         |               |
| qcollector |          1 | datetime_index |            1 | datetime    | A         |    63144443 |     NULL | NULL   | YES  | BTREE      |         |               |
+------------+------------+----------------+--------------+-------------+-----------+------    -------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.03 sec)

select * from qcollector order by datetime desc limit 1;
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| key       | instrument | datetime            | last    | lastsize | totvol  | bid     | ask    |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
| 389054487 | ES         | 2012-06-29 15:14:59 | 1358.25 |        2 | 2484771 | 1358.25 | 1358.5 |
+-----------+------------+---------------------+---------+----------+---------+---------+--------+
1 row in set (0.09 sec)

A typical query that is slow (full table scan, this query takes 3-4 minutes):

explain select date(datetime), count(lastsize) from qcollector where instrument = 'ES' and datetime > '2011-01-01' and time(datetime) between '15:16:00' and '15:29:00' group by date(datetime) order by date(datetime) desc;
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
| id   | select_type | table      | type | possible_keys  | key  | key_len | ref  | rows      | Extra                                        |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+
|    1 | SIMPLE      | qcollector | ALL  | datetime_index | NULL | NULL    | NULL | 378866659 | Using where; Using temporary; Using filesort |
+------+-------------+------------+------+----------------+------+---------+------+-----------+----------------------------------------------+

Solution

A couple ideas for you to consider:

A covering index (that is, an index that includes ALL of the columns referenced in the query) may help some. Such an index is going to require more disk (SSD?) space, but it will remove the necessity for MySQL to visit the data pages to lookup the values of the columns that aren't in the index.

ON qcollector (datetime,instrument,lastsize) or

ON qcollector (instrument,datetime,lastsize)
Do you really need to exclude rows that have a NULL value for lastsize from the count? Could you return a count of all rows instead? If you could instead return COUNT(1) or SUM(1), then the query wouldn't need to reference the lastsize column, so it wouldn't be needed in an index to make it a covering index.

The COUNT(lastsize) expression is equivalent to SUM(IF(lastsize IS NULL,0,1))
Do you need to return dates when there are only NULL lastsize values for the datetime range, or could all of the rows with a NULL lastsize be excluded? That is, could you include a predicate like

AND lastsize IS NOT NULL

in your query?

Those may help some.

I think the big problem is that the predicates on the TIME(datetime) expression are not sargable. That is, MySQL won't use an index range scan operation for those. The predicate on the bare datetime column is sargable... that's why the EXPLAIN is showing the datetime_index as a possible key.

And the other big problem is that the query is doing GROUP BY and ORDER BY operations on a derived expression, which is going to require MySQL to generate an intermediate result set (as a temporary MyISAM table), and then process that result set. And that can be a lot of heavy lifting when there are lots of rows to process.

As far as table changes, I would consider using separate DATE and TIME columns, and using a TIMESTAMP datatype in place of DATETIME (if you need to store the date and time together). I would rewrite the query to reference the bare DATE and bare TIME columns, and consider adding a covering index that included all columns referenced in the rewritten query, with leading columns being the columns with the highest cardinality (and having the most selective predicates in the query.)