I have a database dump from the geonames website (http://download.geonames.org/export/dump/) for Great Britain. It consists of approx 60000 records.
The table structure is as follows:
CREATE TABLE `geoname` (
`geonameid` INT(11) NOT NULL,
`name` VARCHAR(200) NULL DEFAULT NULL,
`asciiname` VARCHAR(200) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
`preferredname` VARCHAR(200) NULL DEFAULT NULL,
`alternatenames` VARCHAR(10000) NULL DEFAULT NULL COLLATE `utf8_unicode_ci',
`latitude` DECIMAL(10,7) NULL DEFAULT NULL,
`longitude` DECIMAL(10,7) NULL DEFAULT NULL,
`feature_class` CHAR(1) NULL DEFAULT NULL,
`feature_code` VARCHAR(10) NULL DEFAULT NULL,
`country_code` VARCHAR(2) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
`cc2` VARCHAR(60) NULL DEFAULT NULL,
`admin1` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
`admin2` VARCHAR(80) NULL DEFAULT NULL COLLATE 'utf8_unicode_ci',
`admin3` VARCHAR(20) NULL DEFAULT NULL,
`admin4` VARCHAR(20) NULL DEFAULT NULL,
`population` INT(11) NULL DEFAULT NULL,
`elevation` INT(11) NULL DEFAULT NULL,
`gtopo30` INT(11) NULL DEFAULT NULL,
`timezone` VARCHAR(40) NULL DEFAULT NULL,
`moddate` DATETIME NULL DEFAULT NULL,
PRIMARY KEY (`geonameid`),
INDEX `geoname_name_idx` (`name`),
INDEX `geoname_preferredname_idx` (`preferredname`),
INDEX `geoname_admin1_idx` (`admin1`),
INDEX `geoname_admin2_idx` (`admin2`),
INDEX `geoname_admin3_idx` (`admin3`),
INDEX `geoname_admin4_idx` (`admin4`),
INDEX `geoname_feature_code_idx` (`feature_code`),
INDEX `geoname_feature_class_idx` (`feature_class`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
I've added indexes to the columns I'm going to use in my query. The query is for an autocomplete feature but it is taking very long to execute - it took 26.72 sec for the query below which is very poor for an autocomplete featue:
mysql> SELECT t0.preferredname,
-> t4.preferredname AS town,
-> t3.preferredname AS county,
-> t2.preferredname AS district,
-> t1.preferredname AS admin1,
-> MIN(t0.geonameid)
-> FROM geoname t0
-> LEFT JOIN geoname t1 ON t1.admin1 = t0.admin1 AND t1.feature_code = 'ADM1'
-> LEFT JOIN geoname t2 ON t2.admin2 = t0.admin2 AND t2.feature_code = 'ADM2'
-> LEFT JOIN geoname t3 ON t3.admin3 = t0.admin3 AND t3.feature_code = 'ADM3'
-> LEFT JOIN geoname t4 ON t4.admin4 = t0.admin4 AND t4.feature_code = 'ADM4'
-> WHERE t0.feature_class IN ('P', 'A')
-> AND t0.preferredname LIKE 'preston%'
-> GROUP BY t0.preferredname,
-> t4.preferredname,
-> t3.preferredname,
-> t2.preferredname,
-> t1.preferredname;
+------------------------------+--------------------+--------------------------------+---------------------+----------+-------------------+
| preferredname | town | county | district | admin1 | MIN(t0.geonameid) |
+------------------------------+--------------------+--------------------------------+---------------------+----------+-------------------+
| Preston | NULL | Ellingham | Northumberland | England | 2639911 |
| Preston | NULL | Preston | District of Rutland | England | 2639914 |
| Preston | NULL | Preston | East Yorkshire | England | 2639913 |
| Preston | NULL | Preston District | Lancashire | England | 2639912 |
| Preston | NULL | Weymouth and Portland District | Dorset | England | 2639922 |
| Preston | Dymock | Forest of Dean District | Gloucestershire | England | 2639916 |
| Preston | Preston | Cotswold District | Gloucestershire | England | 2639918 |
| Preston | Preston | Dover District | Kent | England | 2639920 |
| Preston | Preston | North Hertfordshire District | Hertfordshire | England | 2639917 |
| Preston Bagot | Preston Bagot | Stratford-on-Avon District | Warwickshire | England | 2639910 |
| Preston Bisset | Preston Bissett | Aylesbury Vale | Buckinghamshire | England | 2639909 |
| Preston Bissett | Preston Bissett | Aylesbury Vale | Buckinghamshire | England | 7299788 |
| Preston Brook | NULL | Preston Brook | Borough of Halton | England | 7296534 |
| Preston Candover | Preston Candover | Basingstoke and Deane District | Hampshire | England | 2639908 |
| Preston Capes | Preston Capes | Daventry District | Northamptonshire | England | 2639907 |
| Preston District | NULL | Preston District | Lancashire | England | 7290581 |
| Preston Gubbals | NULL | Pimhill | Shropshire | England | 2639906 |
| Preston on Stour | Preston on Stour | Stratford-on-Avon District | Warwickshire | England | 7299630 |
| Preston on the Hill | NULL | Preston Brook | Borough of Halton | England | 2639904 |
| Preston on Wye | NULL | Preston on Wye | Herefordshire | England | 2639903 |
| Preston Park | NULL | NULL | Brighton and Hove | England | 2639921 |
| Preston Patrick | Preston Patrick | South Lakeland District | Cumbria | England | 7298113 |
| Preston Richard | Preston Richard | South Lakeland District | Cumbria | England | 7300167 |
| Preston Road | NULL | Brent | Greater London | England | 2639919 |
| Preston St Mary | Preston St. Mary | Babergh District | Suffolk | England | 2639915 |
| Preston St. Mary | Preston St. Mary | Babergh District | Suffolk | England | 7301329 |
| Preston upon the Weald Moors | NULL | Preston upon the Weald Moors | Telford and Wrekin | England | 2639900 |
| Preston Wynne | NULL | Preston Wynne | Herefordshire | England | 2639899 |
| Preston-on-Tees | NULL | Preston-on-Tees | Stockton-on-Tees | England | 7299560 |
| Preston-under-Scar | Preston-under-Scar | Richmondshire District | North Yorkshire | England | 7291664 |
| Prestonpans | NULL | NULL | East Lothian | Scotland | 2639902 |
+------------------------------+--------------------+--------------------------------+---------------------+----------+-------------------+
31 rows in set (26.72 sec)
mysql>
When I use profiler on the above query I get the following:
mysql> select substring_index(event_name,'/',-1) as Status, truncate((timer_end-timer_start)/1000000000000,6) as Duration from performance_schema.events_stages_history_long where event_id>=8215932 and event_id<=9810811;
+----------------------+-----------+
| Status | Duration |
+----------------------+-----------+
| starting | 0.000198 |
| checking permissions | 0.000004 |
| checking permissions | 0.000001 |
| checking permissions | 0.000001 |
| checking permissions | 0.000001 |
| checking permissions | 0.000005 |
| Opening tables | 0.000044 |
| init | 0.000088 |
| System lock | 0.000013 |
| optimizing | 0.000022 |
| statistics | 0.075318 |
| preparing | 0.000059 |
| Creating tmp table | 0.000082 |
| Sorting result | 0.000014 |
| executing | 0.000003 |
| Sending data | 24.472337 |
| Creating sort index | 0.000292 |
| end | 0.000007 |
| query end | 0.000022 |
| removing tmp table | 0.000118 |
| closing tables | 0.000024 |
| freeing items | 0.000278 |
| cleaning up | 0.000001 |
+----------------------+-----------+
23 rows in set (0.00 sec)
And when run the query with Explain
I get the following:
mysql> EXPLAIN SELECT t0.preferredname,
-> t4.preferredname AS town,
-> t3.preferredname AS county,
-> t2.preferredname AS district,
-> t1.preferredname AS admin1,
-> MIN(t0.geonameid)
-> FROM geoname t0
-> LEFT JOIN geoname t1 ON t1.admin1 = t0.admin1 AND t1.feature_code = 'ADM1'
-> LEFT JOIN geoname t2 ON t2.admin2 = t0.admin2 AND t2.feature_code = 'ADM2'
-> LEFT JOIN geoname t3 ON t3.admin3 = t0.admin3 AND t3.feature_code = 'ADM3'
-> LEFT JOIN geoname t4 ON t4.admin4 = t0.admin4 AND t4.feature_code = 'ADM4'
-> WHERE t0.feature_class IN ('P', 'A')
-> AND t0.preferredname LIKE 'preston%'
-> GROUP BY t0.preferredname,
-> t4.preferredname,
-> t3.preferredname,
-> t2.preferredname,
-> t1.preferredname;
+----+-------------+-------+------------+-------+-----------------------------------------------------+---------------------------+---------+----------------------+------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+-----------------------------------------------------+---------------------------+---------+----------------------+------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | t0 | NULL | range | geoname_preferredname_idx,geoname_feature_class_idx | geoname_preferredname_idx | 603 | NULL | 55 | 70.01 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | t1 | NULL | ref | geoname_admin1_idx,geoname_feature_code_idx | geoname_feature_code_idx | 33 | const | 4 | 100.00 | Using where |
| 1 | SIMPLE | t2 | NULL | ref | geoname_admin2_idx,geoname_feature_code_idx | geoname_feature_code_idx | 33 | const | 185 | 100.00 | Using where |
| 1 | SIMPLE | t3 | NULL | ref | geoname_admin3_idx,geoname_feature_code_idx | geoname_admin3_idx | 63 | test.t0.admin3 | 14 | 100.00 | Using where |
| 1 | SIMPLE | t4 | NULL | ref | geoname_admin4_idx,geoname_feature_code_idx | geoname_admin4_idx | 63 | test.t0.admin4 | 7 | 100.00 | Using where |
+----+-------------+-------+------------+-------+-----------------------------------------------------+---------------------------+---------+----------------------+------+----------+---------------------------------------------------------------------+
5 rows in set, 1 warning (0.06 sec)
Note I'm using a group by clause because the data has duplicate names for child levels.
How can I optimize this query? Any advise hints and tips would be much appreciated.
I guess you hope to retrieve localities matching a user-furnished incomplete search string, then join the administrative jurisdictions to provide a more informative autocomplete feature.
The trick here is retrieving the candidate localities quickly. A subquery like this will do the trick.
SELECT geonameid, preferredname, admin1, admin2, admin3, admin4
FROM geonames
WHERE feature_class IN ('P', 'A')
AND preferredname LIKE 'preston%'
This is the heart of your lookup operation. It can be accelerated by a compound covering index on
CREATE INDEX lookup1
ON geonames(feature_class, preferredname, admin1, admin2, admin3, admin4);
Try this query. See if it's fast enough for you (subsecond). If it isn't try this variant:
SELECT geonameid, preferredname, admin1, admin2, admin3, admin4
FROM geonames
WHERE feature_class ='P'
AND preferredname LIKE 'preston%'
UNION ALL
SELECT geonameid, preferredname, admin1, admin2, admin3, admin4
FROM geonames
WHERE feature_class ='A'
AND preferredname LIKE 'preston%'
MySQL's query planner can random access the index to the first eligible row, then retrieve everything it needs by scanning the index sequentially.
Then, you use that subquery's result set in your JOIN operation. Now, you only have to process a modest number of relevant rows in the join, rather than the whole mess.
SELECT t0.preferredname,
t4.preferredname AS town,
t3.preferredname AS county,
t2.preferredname AS district,
t1.preferredname AS admin1,
MIN(t0.geonameid) geonameid
FROM (
SELECT geonameid, preferredname, admin1, admin2, admin3, admin4
FROM geonames
WHERE feature_class IN ='P'
AND preferredname LIKE 'preston%'
UNION ALL
SELECT geonameid, preferredname, admin1, admin2, admin3, admin4
FROM geonames
WHERE feature_class ='A'
AND preferredname LIKE 'preston%'
) t0
LEFT JOIN geoname t1 ON t1.admin1 = t0.admin1 AND t1.feature_code = 'ADM1'
LEFT JOIN geoname t2 ON t2.admin2 = t0.admin2 AND t2.feature_code = 'ADM2'
LEFT JOIN geoname t3 ON t3.admin3 = t0.admin3 AND t3.feature_code = 'ADM3'
LEFT JOIN geoname t4 ON t4.admin4 = t0.admin4 AND t4.feature_code = 'ADM4'
GROUP BY t0.preferredname,
t4.preferredname,
t3.preferredname,
t2.preferredname,
t1.preferredname
Pro tip: Lots of single-column indexes rarely accelerate queries with multiple filtering conditions, especially with range filters such as LIKE 'something%'
. Appropriate multi-column indexes are much more helpful.