For my online shop I've a table, that I'm using for search:
CREATE TABLE `store_search` (
`term` varchar(50) NOT NULL DEFAULT '',
`content_id` int(10) unsigned NOT NULL,
`type` enum('keyword','tag') NOT NULL DEFAULT 'keyword',
`random` int(10) unsigned NOT NULL,
`saving` int(10) unsigned NOT NULL,
PRIMARY KEY (`content_id`,`term`,`type`),
UNIQUE KEY `saving` (`term`,`saving`,`random`,`content_id`,`type`),
UNIQUE KEY `random` (`term`,`random`,`content_id`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
Products can be listed in two ways: in a random order (based on column random
) or by discount (based on column saving
). Tests in past showed, that using UNIQUE
constraints for order is much more performant than using standard indexes in conjunction with ORDER BY
. A query can look like:
mysql> EXPLAIN SELECT content_id FROM store_search USE INDEX (random) WHERE term LIKE 'shirt%' AND type='keyword' LIMIT 2000,100;
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+
| 1 | SIMPLE | store_search | range | random | random | 152 | NULL | 9870580 | Using where; Using index |
+----+-------------+--------------+-------+---------------+--------+---------+------+---------+--------------------------+
So I can prevent an ORDER BY
clause (no filesort is done with this approach). PRIMARY KEY
is used for self joins when searching for multiple terms:
mysql> EXPLAIN SELECT DISTINCT x.content_id
-> FROM store_search x USE INDEX (saving)
-> INNER JOIN store_search y ON x.content_id=y.content_id
-> WHERE x.term LIKE 'shirt%' AND x.type='keyword' AND y.term LIKE 'blue%' AND y.type='keyword'
-> LIMIT 0,100;
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+
| 1 | SIMPLE | x | range | PRIMARY,saving,random | saving | 152 | NULL | 11449970 | Using where; Using index; Using temporary |
| 1 | SIMPLE | y | ref | PRIMARY,saving,random | PRIMARY | 4 | x.content_id | 20 | Using where; Using index; Distinct |
+----+-------------+-------+-------+-----------------------+---------+---------+--------------+----------+-------------------------------------------+
As I said, this solution is fine so far. My problem is now: this table is so huge at present (~500mio rows), that the indexes do not fit in memory anymore. This causes, that INSERT
and UPDATE
statements are terribly slow. Data takes 23GB and indexes consume 32GB, so 55GB at all for this table. Testing is possible, but consumes a lot of time when copying this table, but has anyone an approach to reduce the index size?
I'd like to convert collation of string columns to latin_1
, but can I consolidate some of the indexes?
term LIKE 'shirt%'
is a range lookup. INDEX(term, ...)
will not get past term
for filtering in order to get to type
or other columns.
This, and other basic indexing principles are discussed in my Index Cookbook.
So... WHERE term LIKE 'shirt%' AND type='keyword'
begs for INDEX(keyword, term)
. And adding any further columns won't help in filtering.
However... What you are depending on is covering. This is where all the desired columns are in a single index. In this case, the query can be performed in the index BTree without touching the Data BTree. That is, adding extra columns can be beneficial.
There are multiple things going on in
SELECT content_id
FROM store_search USE INDEX (random)
WHERE term LIKE 'shirt%'
AND type='keyword'
LIMIT 2000,100;
UNIQUE KEY `random` (`term`,`random`,`content_id`,`type`)
Here are some:
ORDER BY
, so the output is probably ordered first by term
(assuming there might be more than one phrase beginning with 'shirt'), and only secondarily by random
. This is not quite what you wanted, but may happen to work.LIMIT
requires it scan 2000+100 rows of the index, then quit. It will stop short if there are are not enough shirts. This probably seemed "fast".UNIQUE
is probably irrelevant, and wasteful for inserts.Next query Let's dissect SELECT DISTINCT x.content_id ...
.
DISTINCT
. There may not be a net gain; time it.ORDER BY
, you can't predict which 100 will be delivered.ORDER BY RAND()
won't add much overhead.Bottom line
PRIMARY KEY(type, term, content_id)
. By reaching in via the PK, you effectively get "covering".ORDER BY random
or ORDER BY RAND()
-- see which works better for you. (The latter is more random!)LIKE 'shirt%'
The bottom-bottom-line is that EAV schema design sucks. I discuss this further.