I would like to use BigQuery instead of Pandas to create dummy variables (one-hot-encoding) for my categories. I will end up with about 200 columns, therefore I can't do it manually and hard code it
Test dataset (the actual one has many more variables than this one)
WITH table AS (
SELECT 1001 as ID, 'blue' As Color, 'big' AS size UNION ALL
SELECT 1002 as ID, 'yellow' As Color, 'medium' AS size UNION ALL
SELECT 1003 as ID, 'red' As Color, 'small' AS size UNION ALL
SELECT 1004 as ID, 'blue' As Color, 'small' AS size)
SELECT *
FROM table
Expected result:
Below is for BigQuery Standard SQL
DECLARE Colors, Sizes ARRAY<STRING>;
SET (Colors, Sizes) = (SELECT AS STRUCT ARRAY_AGG(DISTINCT Color), ARRAY_AGG(DISTINCT Size) FROM `project.dataset.table`);
EXECUTE IMMEDIATE '''
CREATE TEMP TABLE result AS -- added line
SELECT ID, ''' || (
SELECT STRING_AGG("COUNTIF(Color = '" || Color || "') AS Color_" || Color ORDER BY Color)
FROM UNNEST(Colors) AS Color
) || (
SELECT ', ' || STRING_AGG("COUNTIF(Size = '" || Size || "') AS Size_" || Size ORDER BY Size)
FROM UNNEST(Sizes) AS Size
) || '''
FROM `project.dataset.table`
GROUP BY ID
ORDER BY ID
'''; -- added `;`
SELECT * FROM result; -- added line
If to applied to sample data in your question - the output is as below