Search code examples
mysqlsqlself-joincorrelated-subquery

How does this self join work?


This is a problem from sqlzoo.net

Here is the table world:

+-------------+-----------+---------+
|    name     | continent |  area   |
+-------------+-----------+---------+
| Afghanistan | Asia      | 652230  |
| Albania     | Europe    | 2831741 |
| Algeria     | Africa    | 28748   |
| ...         | ...       | ...     |
+-------------+-----------+---------+

The Question:

Find the largest country (by area) in each continent, show the continent, the name and the area:

The answer I am trying to understand is:

SELECT continent, name, area 
FROM world x
WHERE area >= ALL (SELECT area 
                   FROM world y
                   WHERE y.continent=x.continent
                   AND area>0)

This code gives:

+-------------+-----------+--------+
| continent   |   name    |  area  |
+-------------+-----------+--------+
|Africa       | Algeria   | 2381741|
|Oceania      | Australia | 7692024|
|South America| Brazil    | 8515767|
|North America| Canada    | 9984670|
|Asia         | China     | 9596961|
|Caribbean    | Cuba      |  109884|
|Europe       | France    |  640679|
|Eurasia      | Russia    |17125242|
+-------------+-----------+--------+

I do not understand how this works. I think that the inner select should yield a table that contains all the areas and the outer select only chooses the biggest (>=). But how does it filter down to a list that seems to have been group by continent? How does y.continent=x.continent AND area>0 work?


Solution

  • To explain, I'll assume a world table with only 5 countries as follows:

    Algeria      2381741
    Australia    7692024
    South Africa 1221037
    New Zealand   268021
    /*And to make it a little interesting:*/
    Algeria Twin 2381741
    

    The sub-query is matched to each row of the base query, 1-at-a-time. This is what's known as a correlated sub-query. Although correlated sub-queries work well, they are generally considered dangerous because of their tendency to yield poor performance characteristics if the optimiser cannot figure out a more efficient, equivalent structure.

    The following table illustrates a logical view on how the data will be evaluated. Note that the database's query engine might be able to internally transform the plan into something mathematically equivalent, but far more efficient.

    +-------------+--------------+--------+
    | continent   |   name       |  area  |
    +-------------+--------------+--------+
    |Africa       | Algeria      | 2381741| >= ALL( /*y.continent='Africa'*/
                                                   2381741, /*Alegria*/
                                                   1221037, /*South Africa*/
                                                   2381741) /*Alegria Twin*/
    |Oceania      | Australia    | 7692024| >= ALL( /*y.continent='Oceania'*/
                                                   7692024, /*Australia*/
                                                   268021)  /*New Zealand*/
    |Africa       | South Africa | 1221037| >= ALL( /*y.continent='Africa'*/
                                                   2381741, /*Alegria*/
                                                   1221037, /*South Africa*/
                                                   2381741) /*Alegria Twin*/
    |Oceania      | New Zealand  |  268021| >= ALL( /*y.continent='Oceania'*/
                                                   7692024, /*Australia*/
                                                   268021)  /*New Zealand*/
    |Africa       | Algeria Twin | 2381741| >= ALL( /*y.continent='Africa'*/
                                                   2381741, /*Alegria*/
                                                   1221037, /*South Africa*/
                                                   2381741) /*Alegria Twin*/
    +-------------+--------------+--------+
    

    From the above, rows 1, 2 and 5 are >= all the sub-query areas. So these are kept, while the other rows are discarded.

    Note that there are a few ways to write the sub-query that will produce the exact same results.

    Being >= all areas on a continent is the same as being = the MAX area on the continent.

    WHERE area = ( SELECT MAX(y.area)
                   FROM   world y
                   WHERE  y.continent=x.continent)
    

    Another way to get the maximum is to get the first row when ordering by area DESC.

    WHERE area = ( SELECT y.area
                   FROM   world y
                   WHERE  y.continent=x.continent
                   ORDER BY y.area DESC LIMIT 1)
    

    However, be careful of the following which seems equivalent, but isn't.

    /* The problem here is that only 1 Algeria will happen to be 
       first in the sub-query. Meaning 1 row will be missing from 
       the final result set. */
    WHERE name = ( SELECT y.name
                   FROM   world y
                   WHERE  y.continent=x.continent
                   ORDER BY y.area DESC LIMIT 1)
    

    Finally, I mentioned that correlated sub-queries may have performance concerns. So it's generally advisable to consider rewriting a correlated sub-query into one that joins directly to the sub-query in the FROM clause if you can. E.g.

    SELECT  x.contient, x.name, x.area
    FROM    world x
            INNER JOIN (
                SELECT MAX(y.area) as max_area, y.continent
                FROM   world y
                GROUP BY y.continent
            ) z ON
                x.continent = z.continent
            AND x.area = z.max_area