Search code examples
sqlgoogle-bigqueryanalytics

How to order table based on total for elements in one column, but the elements are not unique


@Thorsten Kettner mentioned I didn't say what the problem is, so I wanted to add that. Right now the result is ordered by countries alphabetically, but I need it to be ordered by value of "ALL" for each country.

I've been playing with BigQuery public data sets, i.e. bigquery-public-data.geo_international_ports.world_port_index.

I'm selecting 'country' and 'harbor_size' columns and I'm calculating total count of country. I'm adding this as new row with value 'ALL' in column 'harbor_size'. I want to sort on this total count but in a way, where all country elements stay together. Original 'harbor_size' elements: ['S', 'M', 'L', 'V'].

Here's what I have so far: enter image description here

Query:

WITH
  total_n AS 
  (   
    SELECT     
      country,     
      "ALL" AS harbor_size,     
      COUNT(country) as hb_count   
    FROM bigquery-public-data.geo_international_ports.world_port_index
    GROUP BY country 
    ORDER BY country ASC, hb_count DESC
  )
SELECT * FROM total_n 

UNION ALL

SELECT 
  country, 
  harbor_size, 
  COUNT(country) as hb_count 
FROM bigquery-public-data.geo_international_ports.world_port_index
GROUP BY country, harbor_size 
ORDER BY country ASC, hb_count DESC;

So in the end, I'd like to have something like this:

country h_size count remark
US ALL 10,000 <- sort on this
US L 7,000
US M 2,000
US S 1,000
CA ALL 9,000 <- sort on this
CA L 6,000
CA M 2,000
CA S 1,000
... ... ... ...

I was considering having temp table with only 'ALL' in 'harbor_size' then adding index after sorting, so that each country would have an index, then unioning this with the other table (country, harbor_size w/ all elements, count) and sorting on the index.


Solution

  • Your query sorts the countries alphabetically and even has the problem that countries having only one harbour size get their 'ALL' row and size row in arbitrary order, as they have the same count.

    Instead you want to order the countries by the total harbour count descending and then by country and 'ALL' followed by the country's harbor sizes.

    The ORDER BY clause in your CTE is superfluous by the way, because subquery results are unordered data sets. The DBMS is free to ignore this clause altogether.

    You can get the total with GROUPING SETS, as you want to group by country and horbour size and also by country. This makes the query simpler. To get the countries ordered by their total count, use a window function (SUM(COUNT(*)) OVER (PARTITION BY country)).

    SELECT     
      country,
      COALESCE(harbor_size, 'ALL') AS hb_size,     
      COUNT(*) as hb_count   
    FROM bigquery-public-data.geo_international_ports.world_port_index
    GROUP BY GROUPING SETS ( (country, harbor_size), (country) )
    ORDER BY
      SUM(COUNT(*)) OVER (PARTITION BY country) DESC, -- countries with many harbours first
      country ASC, -- get countries with the same harbour count in alphabetical order
      hb_count DESC, -- within each country get the harbour sizes ordered by their count
      hb_size ASC; -- and get 'ALL' before the harbour size row, for countries with just one size