How does this self join work?

This is a problem from sqlzoo.net

Here is the table world:

+-------------+-----------+---------+
|    name     | continent |  area   |
+-------------+-----------+---------+
| Afghanistan | Asia      | 652230  |
| Albania     | Europe    | 2831741 |
| Algeria     | Africa    | 28748   |
| ...         | ...       | ...     |
+-------------+-----------+---------+

The Question:

Find the largest country (by area) in each continent, show the continent, the name and the area:

The answer I am trying to understand is:

SELECT continent, name, area 
FROM world x
WHERE area >= ALL (SELECT area 
                   FROM world y
                   WHERE y.continent=x.continent
                   AND area>0)

This code gives:

+-------------+-----------+--------+
| continent   |   name    |  area  |
+-------------+-----------+--------+
|Africa       | Algeria   | 2381741|
|Oceania      | Australia | 7692024|
|South America| Brazil    | 8515767|
|North America| Canada    | 9984670|
|Asia         | China     | 9596961|
|Caribbean    | Cuba      |  109884|
|Europe       | France    |  640679|
|Eurasia      | Russia    |17125242|
+-------------+-----------+--------+

I do not understand how this works. I think that the inner select should yield a table that contains all the areas and the outer select only chooses the biggest (>=). But how does it filter down to a list that seems to have been group by continent? How does y.continent=x.continent AND area>0 work?

Solution

To explain, I'll assume a world table with only 5 countries as follows:

Algeria      2381741
Australia    7692024
South Africa 1221037
New Zealand   268021
/*And to make it a little interesting:*/
Algeria Twin 2381741

The sub-query is matched to each row of the base query, 1-at-a-time. This is what's known as a correlated sub-query. Although correlated sub-queries work well, they are generally considered dangerous because of their tendency to yield poor performance characteristics if the optimiser cannot figure out a more efficient, equivalent structure.

The following table illustrates a logical view on how the data will be evaluated. Note that the database's query engine might be able to internally transform the plan into something mathematically equivalent, but far more efficient.

+-------------+--------------+--------+
| continent   |   name       |  area  |
+-------------+--------------+--------+
|Africa       | Algeria      | 2381741| >= ALL( /*y.continent='Africa'*/
                                               2381741, /*Alegria*/
                                               1221037, /*South Africa*/
                                               2381741) /*Alegria Twin*/
|Oceania      | Australia    | 7692024| >= ALL( /*y.continent='Oceania'*/
                                               7692024, /*Australia*/
                                               268021)  /*New Zealand*/
|Africa       | South Africa | 1221037| >= ALL( /*y.continent='Africa'*/
                                               2381741, /*Alegria*/
                                               1221037, /*South Africa*/
                                               2381741) /*Alegria Twin*/
|Oceania      | New Zealand  |  268021| >= ALL( /*y.continent='Oceania'*/
                                               7692024, /*Australia*/
                                               268021)  /*New Zealand*/
|Africa       | Algeria Twin | 2381741| >= ALL( /*y.continent='Africa'*/
                                               2381741, /*Alegria*/
                                               1221037, /*South Africa*/
                                               2381741) /*Alegria Twin*/
+-------------+--------------+--------+

From the above, rows 1, 2 and 5 are >= all the sub-query areas. So these are kept, while the other rows are discarded.

Note that there are a few ways to write the sub-query that will produce the exact same results.

Being >= all areas on a continent is the same as being = the MAX area on the continent.

WHERE area = ( SELECT MAX(y.area)
               FROM   world y
               WHERE  y.continent=x.continent)

Another way to get the maximum is to get the first row when ordering by area DESC.

WHERE area = ( SELECT y.area
               FROM   world y
               WHERE  y.continent=x.continent
               ORDER BY y.area DESC LIMIT 1)

However, be careful of the following which seems equivalent, but isn't.

/* The problem here is that only 1 Algeria will happen to be 
   first in the sub-query. Meaning 1 row will be missing from 
   the final result set. */
WHERE name = ( SELECT y.name
               FROM   world y
               WHERE  y.continent=x.continent
               ORDER BY y.area DESC LIMIT 1)

Finally, I mentioned that correlated sub-queries may have performance concerns. So it's generally advisable to consider rewriting a correlated sub-query into one that joins directly to the sub-query in the FROM clause if you can. E.g.

SELECT  x.contient, x.name, x.area
FROM    world x
        INNER JOIN (
            SELECT MAX(y.area) as max_area, y.continent
            FROM   world y
            GROUP BY y.continent
        ) z ON
            x.continent = z.continent
        AND x.area = z.max_area