google-bigquery one-hot-encoding dummy-variable

one-hot-encoding (dummy variables) with BigQuery

I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding) for my categories. I will end up with about 200 columns, therefore I can't do it manually and hard code it

Test dataset (the actual one has many more variables than this one)

WITH table AS (
SELECT 1001 as ID, 'blue' As Color, 'big' AS size UNION ALL
SELECT 1002 as ID, 'yellow' As Color, 'medium' AS size UNION ALL
SELECT 1003 as ID, 'red' As Color, 'small' AS size UNION ALL
SELECT 1004 as ID, 'blue' As Color, 'small' AS size)

SELECT *
FROM table

Expected result:

Solution

Below is for BigQuery Standard SQL

DECLARE Colors, Sizes ARRAY<STRING>;

SET (Colors, Sizes) = (SELECT AS STRUCT ARRAY_AGG(DISTINCT Color), ARRAY_AGG(DISTINCT Size) FROM `project.dataset.table`);

EXECUTE IMMEDIATE '''
CREATE TEMP TABLE result AS  -- added line
SELECT ID, ''' || (
  SELECT STRING_AGG("COUNTIF(Color = '" || Color || "') AS Color_" || Color ORDER BY Color)
  FROM UNNEST(Colors) AS Color
) || (
  SELECT ', ' || STRING_AGG("COUNTIF(Size = '" || Size || "') AS Size_" || Size ORDER BY Size)
  FROM UNNEST(Sizes) AS Size
) || '''
FROM `project.dataset.table`
GROUP BY ID
ORDER BY ID
''';  -- added `;`

SELECT * FROM result;  -- added line

If to applied to sample data in your question - the output is as below