Search code examples
sqlgroup-bygoogle-bigquery

BigQuery groupby, add row with count of 0 for groupby variable if no rows found


I am new to BigQuery and have a problem that I think is broadly applicable to a lot of different group by / data manipulation problems. I run the following query which groups by team, game and by zone, and returns the count for each:

SELECT
  rebounds.team_id,
  rebounds.game_id,
  CASE
    WHEN distance < 4 THEN 'zone1'
    WHEN NOT some_bool THEN 'zone2'
    WHEN distance >= 4 and distance2 < 12 THEN 'zone3'
    WHEN some_bool AND distance >= 4 THEN 'zone4'
  END AS this_zone,
  COUNT(*) AS my_count
FROM t1
GROUP BY 1,2,3

And I get the following table (showing for a single game_id / team_id combination for simplicity to emphasize the question):

team_id game_id this_zone my_count
     12      15     zone1        5
     12      15     zone2        3

In the example above, there were no observations that met the criteria for zone3, and as a result the table only returned two rows for this game_id + team_id combination. However, I would like to have a 3rd row, so that the table looks like this:

team_id game_id this_zone my_count
     12      15     zone1        5
     12      15     zone2        3
     12      15     zone3        2
     12      15     zone4        0

Is this possible, to specify one of the groupby variables (in this case this_zone), and have BigQuery create a row with count 0 if the count is zero?

Edit

I've updated the CASE WHEN to more accurately reflect the complexity of how the zones are created. In short, they rely on two different numerical values (distance and distance2 in this example), as well as a boolean value (some_bool).


Solution

  • Below for BigQuery Standard SQL

    #standardSQL
    SELECT team_id, game_id, this_zone, IFNULL(my_count, 0) AS my_count
    FROM (
      SELECT DISTINCT
        rebounds.team_id,
        rebounds.game_id, 
        this_zone
      FROM t1, UNNEST(['zone1', 'zone2', 'zone3', 'zone4']) this_zone
    ) A 
    LEFT JOIN (
      SELECT
        rebounds.team_id,
        rebounds.game_id,
        CASE
          WHEN distance < 4 THEN 'zone1'
          WHEN NOT some_bool THEN 'zone2'
          WHEN distance >= 4 AND distance2 < 12 THEN 'zone3'
          WHEN some_bool AND distance >= 4 THEN 'zone4'
        END AS this_zone,
        COUNT(*) AS my_count
      FROM t1
      GROUP BY 1,2,3
    ) B
    USING (team_id, game_id, this_zone)   
    

    Above is generic enough and has no dependency on how complex or not your logic - you just generate all expected rows (sub-query A) and left join it on your original query - that's all!