Search code examples
structgoogle-bigquerygoogle-cloud-vertex-ai

How to get top 3 columns and their values across multiple columns (dynamically) in BigQuery


I have a table that looks like this

select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4

I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below

select 
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
 output

output

I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change

For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions

I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)

EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from

select 'Alice' AS ID,  STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT  'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions

Solution

  • Consider below query.

    SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
      FROM (
        SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
      ) 
     GROUP BY ID;
    

    Query results

    enter image description here

    Dynamic Query

    I would like it to be dynamic, so avoid writing out columns individually

    You need to consider a dynamic SQL for this. By refering to the answer from @Mikhail you linked in the post, you can write a dynamic query like below.

    EXECUTE IMMEDIATE FORMAT("""
      SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
        FROM (
          SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
        ) 
       GROUP BY ID
    """, ARRAY_TO_STRING(
      REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
    );
    

    For updated sample table

    SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
      FROM (
        SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table) 
       UNPIVOT (value FOR column IN (col1, col2, col3, col4))
      ) 
     GROUP BY ID;
    
    EXECUTE IMMEDIATE FORMAT("""
      SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
        FROM (
          SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
         UNPIVOT (value FOR column IN (%s))
        ) 
       GROUP BY ID
    """, ARRAY_TO_STRING(
      REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
    );