Search code examples
sqlsnowflake-cloud-data-platformone-hot-encoding

Represent categorical column as One-Hot Encoding using SQL


I want to represent a string column as a binary 1 or 0 by pivoting the string column and making its values as header using SQL (Snowflake). It would python equivalent of pd.get_dummies where the function transform categorical column in to one-hot encoded columns. How do I do that?

Sample Data:

id value
A 'G802'
A 'R620'
A ''
B 'J209'
B 'B009'
C ''
C 'R509'

Expected Output:

id G802 R620 J209 B009 R509
A 1 1 0 0 0
B 0 0 1 1 0
C 0 0 0 0 1

I need the query to be dynamic as I have more and random values in the value column. The order of the column can be anything. The '' value in value can or cannot be part of the pivot table. I have tried a few options with the pivot function but no luck. I very much appreciate your help. Thank you!


Solution

  • It is possible to achieve it with SQL and dynamic PIVOT:

    SELECT *
    FROM (SELECT id, VALUE, IFF(value='',0, 1) AS VALUE2 FROM t)
    PIVOT (MAX(VALUE2) FOR VALUE IN (SELECT VALUE FROM t WHERE VALUE != '') 
           DEFAULT ON NULL (0));
    

    For test data:

    CREATE OR REPLACE TABLE t(id TEXT, VALUE TEXT) AS
    SELECT 'A', 'G802' UNION ALL 
    SELECT 'A', 'R620' UNION ALL 
    SELECT 'A', '' UNION ALL 
    SELECT 'B', 'J209'UNION ALL  
    SELECT 'B', 'B009'UNION ALL  
    SELECT 'C', '' UNION ALL 
    SELECT 'C', 'R509';
    

    Output:

    enter image description here

    How does it work:

    • IFF(value='',0, 1) AS VALUE2 helper column to categorize data 0 or 1
    • PIVOT MAX(VALUE2) - result of pivot based on helper column
    • VALUE IN (SELECT VALUE FROM t WHERE VALUE != '') - dynamic pivot, skipping ''
    • if we don't bother empty column '' then simply VALUE IN (ANY)

    Using VALUE IN (ANY) and EXCLUDE:

    SELECT * EXCLUDE "''"
    FROM (SELECT id, VALUE, IFF(value='',0, 1) AS VALUE2 FROM t)
    PIVOT (MAX(VALUE2) FOR VALUE IN (ANY) DEFAULT ON NULL (0));