Search code examples
google-bigqueryone-hot-encodingdummy-variable

one-hot-encoding (dummy variables) with BigQuery


I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding) for my categories. I will end up with about 200 columns, therefore I can't do it manually and hard code it

Test dataset (the actual one has many more variables than this one)

WITH table AS (
SELECT 1001 as ID, 'blue' As Color, 'big' AS size UNION ALL
SELECT 1002 as ID, 'yellow' As Color, 'medium' AS size UNION ALL
SELECT 1003 as ID, 'red' As Color, 'small' AS size UNION ALL
SELECT 1004 as ID, 'blue' As Color, 'small' AS size)

SELECT *
FROM table

enter image description here

Expected result:

enter image description here


Solution

  • Below is for BigQuery Standard SQL

    DECLARE Colors, Sizes ARRAY<STRING>;
    
    SET (Colors, Sizes) = (SELECT AS STRUCT ARRAY_AGG(DISTINCT Color), ARRAY_AGG(DISTINCT Size) FROM `project.dataset.table`);
    
    EXECUTE IMMEDIATE '''
    CREATE TEMP TABLE result AS  -- added line
    SELECT ID, ''' || (
      SELECT STRING_AGG("COUNTIF(Color = '" || Color || "') AS Color_" || Color ORDER BY Color)
      FROM UNNEST(Colors) AS Color
    ) || (
      SELECT ', ' || STRING_AGG("COUNTIF(Size = '" || Size || "') AS Size_" || Size ORDER BY Size)
      FROM UNNEST(Sizes) AS Size
    ) || '''
    FROM `project.dataset.table`
    GROUP BY ID
    ORDER BY ID
    ''';  -- added `;`
    
    SELECT * FROM result;  -- added line   
    

    If to applied to sample data in your question - the output is as below

    enter image description here