Search code examples
sqlgroup-bygoogle-bigquerymaxgreatest-n-per-group

SQL - BigQuery - Using Group & MAX in several columns - Similar to a pivot table


How would you approach this via SQL? Let's take this example

| id | type | score_a | score_b | score_c | label_a | label_b | label_c |
|----|------|---------|---------|---------|---------|---------|---------|
| 1  | A    | 0.9     |         |         | L1      |         |         |
| 1  | B    |         | 0.7     |         |         | L2      |         |
| 1  | B    |         | 0.2     |         |         | L3      |         |
| 1  | C    |         |         | 0.2     |         |         | L4      |
| 1  | C    |         |         | 0.18    |         |         | L5      |
| 1  | C    |         |         | 0.12    |         |         | L6      |
| 2  | A    | 0.6     |         |         | L1      |         |         |
| 2  | A    | 0.3     |         |         | L2      |         |         |

I want to return the max score per type in conjunction with the label_X, Almost, like a pivot table but with these custom column names. So the outcome of the above will be like:

| id | type | score_a | label_a | score_b | label_b | score_c | label_c |
|----|------|---------|---------|---------|---------|---------|---------|
| 1  | A    | 0.9     | L1      | 0.7     | L2      | 0.2     | L4      |
| 2  | A    | 0.6     | L1      | NULL    | NULL    | NULL    | NULL    |

Something like this is wrong as it yields both results per type per label

SELECT id,
  MAX(score_a) as score_a,
  label_a,
  MAX(score_b) as score_b,
  label_b as label_b,
  MAX(score_c) as score_c,
  label_c
FROM sample_table 
GROUP BY id, label_a, label_b, label_c

Is there an easy way to do this via SQL, I'm doing it right now from BigQuery and tried also pivot table as described here but still no luck on how to flatten these into one big row with several columns

Any other ideas?

UPDATE

Expanding on what BGM mentioned about design; the source of this data is a table with the following form:

| id | type | label | score |
|----|------|-------|-------|
| 1  | A    | L1    | 0.9   |
| 1  | B    | L2    | 0.7   |
| 1  | B    | L3    | 0.2   |
| 1  | C    | L4    | 0.6   |
| 1  | C    | L5    | 0.2   |

That gets converted to a flattened state as depicted at the top of this question using a query like

 SELECT id,
      type,
      MAX(CASE WHEN type = 'A' THEN score ELSE 0 END) as score_a,
      MAX(CASE WHEN type = 'B' THEN score ELSE 0 END) as score_b,
      MAX(CASE WHEN type = 'C' THEN score ELSE 0 END) as score_c,
      MAX(CASE WHEN model_type = 'theme' THEN label_score ELSE 0 END) as 
      -- labels
      (CASE WHEN type = 'A' THEN label ELSE '' END) as label_a,
      (CASE WHEN type = 'B' THEN label ELSE '' END) as label_b,
      (CASE WHEN type = 'C' THEN label ELSE '' END) as label_c,
    FROM table
    GROUP id, label_a, label_b, label_c

Do you think the intermediate step is unnecessary to get to the final solution?


Solution

  • You can do conditional aggregation. In Big Query, arrays come handy for this:

    select
        id,
        max(score_a) score_a,
        array_agg(label_a order by score_a desc limit 1)[offset(0)] label_a,
        max(score_b) score_b,
        array_agg(label_b order by score_b desc limit 1)[offset(0)] label_b,
        max(score_c) score_c,
        array_agg(label_c order by score_c desc limit 1)[offset(0)] label_c
    from mytable
    group by id
    

    Note: in terms of design, you should not have multiple columns to store the scores and labels per types; you already have a column that represents the types, so you should have just two columns for the store and type.