I'm constructing a dynamic query to select dropped domain names from my database. At the moment there are a dozen rows but I'm going to get data soon which will have records of up to 500,000 rows.
The schema is just one table containing 4 columns:
CREATE TABLE `DroppedDomains` (
`domainID` int(11) NOT NULL AUTO_INCREMENT,
`DomainName` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
`DropDate` date DEFAULT NULL,
`TLD` varchar(5) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`domainID`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
I did not create the schema, this is the live database schema. Here's sample data:
I've constructed probably the most complex type of query below. The criteria is as follows:
SELECT any number of domains which
- Start with the word 'starts'
- End with the word 'ends'
- Contain the word 'containsThis' anywhere in the domain name
- Contain the word 'ContainsThisToo' anywhere in the domain name
- Include at least one digit
- The domain name must be at least 49 characters. Multibytes need to count as one character( I used CHAR_LENGTH ).
- The domain name must be at least under 65 characters.
- The TLD must be 'org'
- The DropDate needs to be later than
2009-11-01
Here's my query so far:
SELECT
*
FROM
DroppedDomains
WHERE
1=1
AND DomainName LIKE 'starts%ends'
AND DomainName LIKE '%containsThis%'
AND DomainName LIKE '%containsThisToo%'
AND DomainName LIKE '%-%'
AND DomainName REGEXP '[0-9]'
AND CHAR_LENGTH(DomainName) > 49
AND CHAR_LENGTH(DomainName) < 65
AND TLD = 'org'
AND DropDate > '2009-11-01'
Here are my questions
Would it extremely benefit the performance considering I'll have half a million rows, if I made the TLD
column its own table and just make the TLD
column a foreign key to that? There will only be 5 TLDs ( com, net, org, info, biz ). I realize there are more TLDs in the real world, but this application will only have 5. The user cannot specify their own TLD.
I know that REGEXP
and 500,000 rows is probably a recipe for disaster. Is there anyway I can avoid the REGEXP
?
Are there any other optimizations to the query I can do? Like merge LIKE
s or use other functions such as maybe INSTR
? And should I implement any specific sort of caching mechanism?
When you have a LIKE pattern that starts with a constant prefix and you have an index on that field, then the index can be used to find the rows starting with the prefix very quickly. Luckily you have exactly this situation here:
AND DomainName LIKE 'starts%ends'
If only a few of the values start with starts
then these rows will be found very quickly and the other expressions will only be tested for these rows. You can check that the index is used by running EXPLAIN SELECT ...
.