Search code examples
mysqlsqlrelational-databasequery-optimizationquery-performance

Comparing Query performance: Join Vs Select Distinct From Table


I have two tables person and city. person table and city table are connected using city_id in person. person table contains around a million rows and city table have around 10000 rows.

indexes on person: index1: id, index2: city_id
indexes on city:   index1: id

I need to select all those cities which do not have a person row associated with it. city and person table are as follows (demo data).

CITY                PERSON

id  city            id  name   city_id
-------------       ------------------
1    city-1         1   name-1   1
2    city-2         2   name-2   2
3    city-3         3   name-3   2
4    city-4         4   name-4   3
5    city-5         5   name-5   1
6    city-6         6   name-6   3
7    city-7         7   name-7   4
8    city-8         8   name-8   8

I have written two queries to get the result:

query1:

     select c.id, c.city 
     from city c 
     left join person p on c.id = p.city_id  
     where p.id is null

query2:

     select * 
     from city 
     where id not in ( select distinct city_id from person)

execution plan of both the queries looks similar:

For query 1: mysq for query 2: enter image description here

Then I used profiling and ran both queries a couple of times to see how much time are they taking:

query1: 0.000729 0.000737 0.000763
query2: 0.000857 0.000840 0.000852

Clearly from above data query1 outperforms query2.

I am confused as what I understand query2 should outperform query1. Because query2's nested query is using city_id which is indexed and mysql can take advantage of city_id index to get all id's but query1 is using join which will take cartesian product of both tables. Is it because I used less data f. person(1000) and city(200) records.

What am I missing due to which query1 is performing better than query2.

Edit

From mysql documentation:

covering index: An index that includes all the columns retrieved by a query. Instead of using 
the index values as pointers to find the full table rows, the query returns values 
from the index structure, saving disk I/O

This was the assumption which I made when I came up with query2.


Solution

  • Your performance differences are very small. You really have to run queries multiple times to see if the differences are relevant. The number of rows is quite small as well. In all likelihood, all the data is on just one or two data pages. So, you can't generalize from your example (even if the results are correct).

    I would recommend writing this as:

    select c.* 
    from city c
    where not exists (select 1 from person p where p.city_id = c.id);
    

    And for performance, you want an index on person(city_id).

    This probably has the same execute plan as the left join. I just find it a clearer statement of intent -- and it generally has very good performance on any database.

    The not in is not exactly equivalent. Here are some reasons:

    1. The select distinct could throw off the optimizer. It is not needed, but some databases might actually run a distinct.
    2. NULLs are handled differently. If any row in the subquery returns a NULL value, then no rows at all will be returned from the outer query.