I have a database table that contains over 90,000 rows.
Each row contains a word that is either an adjective, adverb, noun, or verb (as indicated by a Type column).
I need to run a MySQL query that returns a small number (e.g. 3 - 10) of randomly chosen nouns where the length of the word is between two supplied numbers (e.g. 4 - 8).
The performance of the queries I've tried is not that great.
Here's the table structure:
CREATE TABLE `words` (
`id` int(11) NOT NULL,
`type` char(1) COLLATE utf8_unicode_ci NOT NULL,
`word` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`variations` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`free_writing_prompt` smallint(1) NOT NULL DEFAULT '0',
`word_length` smallint(5) NOT NULL DEFAULT '0'
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `words`
ADD PRIMARY KEY (`id`),
ADD UNIQUE KEY `word_type` (`type`,`word`),
ADD KEY `type` (`type`),
ADD KEY `Word Length` (`word_length`);
ALTER TABLE `words`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;
The free_writing_prompt column is set to 0 or 1, 1 meaning it's a valid option for the query, and 0 meaning ignore it. At the moment, all of the nouns have this column set to 1, but the plan is to change many words to 0 because they are not nouns that should be returned.
Here's the row count, by Type:
Adjectives: 21,499
Adverbs: 4,475
Nouns: 58,670
Verbs: 8,978
This is the first query I tried:
SELECT word
FROM words
WHERE type='n'
AND free_writing_prompt=1
AND CHAR_LENGTH(word)>=4
AND CHAR_LENGTH(word)<=8
ORDER BY RAND()
LIMIT 3;
After this, I decided to add the word_length column, which was not originally part of the table definition, because using the CHAR_LENGTH function on all those rows seemed inefficient when I could pre-calculate and store that value instead.
I also thought it would be worthwhile building a Key index on the word_length column, but I'm not convinced it's really helped.
I then tried the following three queries, but they all still take about 16 seconds to run:
SELECT word
FROM words
WHERE type='n'
AND free_writing_prompt=1
AND word_length>=4
AND word_length<=8
ORDER BY RAND()
LIMIT 3;
SELECT word
FROM words
WHERE type='n'
AND free_writing_prompt=1
AND word_length IN (
SELECT word_length
FROM words
WHERE word_length>=4
AND word_length<=8
)
ORDER BY RAND()
LIMIT 3;
SELECT word
FROM words
WHERE word_length IN (
SELECT word_length
FROM words
WHERE type='n'
AND free_writing_prompt=1
AND word_length>=4
AND word_length<=8
)
ORDER BY RAND()
LIMIT 3;
I then tried this Union query, which was fractionally faster (but maybe not enough to be statistically significant):
SELECT * FROM (
SELECT word FROM words WHERE type='n' AND free_writing_prompt=1 AND word_length=4
UNION
SELECT word FROM words WHERE type='n' AND free_writing_prompt=1 AND word_length=5
UNION
SELECT word FROM words WHERE type='n' AND free_writing_prompt=1 AND word_length=6
UNION
SELECT word FROM words WHERE type='n' AND free_writing_prompt=1 AND word_length=7
UNION
SELECT word FROM words WHERE type='n' AND free_writing_prompt=1 AND word_length=8
) a
ORDER BY RAND() LIMIT 3;
Does anybody have any ideas on how to optimize this query to improve the run time?
Composite index:
WHERE type='n'
AND free_writing_prompt=1
AND word_length>=4
AND word_length<=8
begs for
INDEX(type, free_writing_prompt, -- in either order
word_length) -- last
Even better, is to stretch it into a covering index:
INDEX(type, free_writing_prompt, -- in either order
word_length, -- last (for indexing purposes)
word) -- to make "covering"
Your first SELECT
will work well with this covering index. Alas it won't be perfect since it will build a temp table of several thousand, sort it, then peel off 3 rows.