Search code examples
mysqlindexinginnodbentity-attribute-valuemysql-5.6

MySQL reduce index size for huge table


For my online shop I've a table, that I'm using for search:

CREATE TABLE `store_search` (
  `term` varchar(50) NOT NULL DEFAULT '',
  `content_id` int(10) unsigned NOT NULL,
  `type` enum('keyword','tag') NOT NULL DEFAULT 'keyword',
  `random` int(10) unsigned NOT NULL,
  `saving` int(10) unsigned NOT NULL,
  PRIMARY KEY (`content_id`,`term`,`type`),
  UNIQUE KEY `saving` (`term`,`saving`,`random`,`content_id`,`type`),
  UNIQUE KEY `random` (`term`,`random`,`content_id`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED

Products can be listed in two ways: in a random order (based on column random) or by discount (based on column saving). Tests in past showed, that using UNIQUE constraints for order is much more performant than using standard indexes in conjunction with ORDER BY. A query can look like:

mysql> EXPLAIN SELECT content_id FROM store_search USE INDEX (random) WHERE term LIKE 'shirt%' AND type='keyword' LIMIT 2000,100;
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+
| id | select_type | table        | type  | possible_keys | key    | key_len | ref  | rows    | Extra                    |
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | store_search | range | random        | random | 152     | NULL | 9870580 | Using where; Using index |
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+

So I can prevent an ORDER BY clause (no filesort is done with this approach). PRIMARY KEY is used for self joins when searching for multiple terms:

mysql> EXPLAIN SELECT DISTINCT x.content_id
    -> FROM store_search x USE INDEX (saving)
    -> INNER JOIN store_search y ON x.content_id=y.content_id
    -> WHERE x.term LIKE 'shirt%' AND x.type='keyword' AND y.term LIKE 'blue%' AND y.type='keyword'
    -> LIMIT 0,100;
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+
| id | select_type | table | type  | possible_keys         | key     | key_len | ref          | rows     | Extra                                     |
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+
|  1 | SIMPLE      | x     | range | PRIMARY,saving,random | saving  | 152     | NULL         | 11449970 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | y     | ref   | PRIMARY,saving,random | PRIMARY | 4       | x.content_id |       20 | Using where; Using index; Distinct        |
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+

As I said, this solution is fine so far. My problem is now: this table is so huge at present (~500mio rows), that the indexes do not fit in memory anymore. This causes, that INSERT and UPDATE statements are terribly slow. Data takes 23GB and indexes consume 32GB, so 55GB at all for this table. Testing is possible, but consumes a lot of time when copying this table, but has anyone an approach to reduce the index size? I'd like to convert collation of string columns to latin_1, but can I consolidate some of the indexes?


Solution

  • term LIKE 'shirt%' is a range lookup. INDEX(term, ...) will not get past term for filtering in order to get to type or other columns.

    This, and other basic indexing principles are discussed in my Index Cookbook.

    So... WHERE term LIKE 'shirt%' AND type='keyword' begs for INDEX(keyword, term). And adding any further columns won't help in filtering.

    However... What you are depending on is covering. This is where all the desired columns are in a single index. In this case, the query can be performed in the index BTree without touching the Data BTree. That is, adding extra columns can be beneficial.

    There are multiple things going on in

    SELECT  content_id
        FROM  store_search USE INDEX (random)
        WHERE  term LIKE 'shirt%'
          AND  type='keyword'
        LIMIT  2000,100; 
    UNIQUE KEY `random` (`term`,`random`,`content_id`,`type`)
    

    Here are some:

    • The index is "covering".
    • There is no ORDER BY, so the output is probably ordered first by term (assuming there might be more than one phrase beginning with 'shirt'), and only secondarily by random. This is not quite what you wanted, but may happen to work.
    • The LIMIT requires it scan 2000+100 rows of the index, then quit. It will stop short if there are are not enough shirts. This probably seemed "fast".
    • UNIQUE is probably irrelevant, and wasteful for inserts.

    Next query Let's dissect SELECT DISTINCT x.content_id ....

    • You have replaced "filesort" with similar (possibly faster) code for DISTINCT. There may not be a net gain; time it.
    • If there are 999 blue shirts, it will find all 999, then DISTINCTify them, then deliver 100 of them.
    • Without an ORDER BY, you can't predict which 100 will be delivered.
    • Since you are already collecting all 999, adding ORDER BY RAND() won't add much overhead.
    • Do you really want 'blue-green' shirts to be returned, but not 'light blue'? And what about 'dress%' picking up 'dress pants'? Kinky.

    Bottom line

    • Replace the 3 indexes with just PRIMARY KEY(type, term, content_id). By reaching in via the PK, you effectively get "covering".
    • Use ORDER BY random or ORDER BY RAND() -- see which works better for you. (The latter is more random!)
    • Rethink the wild card in LIKE 'shirt%'

    The bottom-bottom-line is that EAV schema design sucks. I discuss this further.