Search code examples
sqlsnowflake-cloud-data-platformsnowflake-connector

Snowpark Python: how to loop over columns / execute code specific to column


I have one big table in a snowflake db which I want to split into smaller tables according to a column while flattening one column into many columns.

The big table shows animals of three categories (lion, tiger, zebra). I want to split it up into a separate lion, tiger and zebra table. On top I want to flatten a json blob (column "Details") into different columns.

How can I do that?

One way to do it is to write a user defined function with snowpark (Python), convert the table to a pandas DataFrame and then use normal Python code. I think there is a simpler way without the costly transformation to a pandas DataFrame. Maybe there even is a solution in pure SQL.

Original table

Animal Name Details (json blob)
Lion Georg lion key1: value1, lion key2: value2
Tiger John tiger key1: value1, tiger key2: value2, tiger key3: value3
Lion Patrick lion key1: value1, lion key2: value2
Tiger Sam tiger key1: value1, tiger key2: value2, tiger key3: value3
Lion Paul lion key1: value1, lion key2: value2
Zebra Sarah zebra key1: value1

New table: Lion table

Name lion key1 lion key2
Georg value1 value2
Patrick value1 value2
Paul value1 value2

New table: tiger table

Name tiger key1 tiger key2 tiger key3
John value1 value2 value3
Sam value1 value2 value3

New table: zebra table

Name zebra key1
Sarah value1

Solution

  • First, let's setup your data so we can play with it:

    create temp table all_animals as
    with data as (
    select split(value, '\t') x, x[0]::string animal, x[1]::string name
        , parse_json('{' || regexp_replace(x[2], '([a-z]+) key([0-9]): (value[0-9])', '"\\1_key\\2": "\\3"')  || '}') details
    from table(split_to_table(
    $$Lion  Georg   lion key1: value1, lion key2: value2
    Tiger   John    tiger key1: value1, tiger key2: value2, tiger key3: value3
    Lion    Patrick lion key1: value1, lion key2: value2
    Tiger   Sam tiger key1: value1, tiger key2: value2, tiger key3: value3
    Lion    Paul    lion key1: value1, lion key2: value2
    Zebra   Sarah   zebra key1: value1$$
    , '\n'))  
    )
    select * 
    from data
    

    enter image description here

    Now let's create the tables where we will insert the data:

    create temp table lions (name string, v1 string, v2 string);
    create temp table tigers (name string, v1 string, v2 string, v3 string);
    

    And now comes the answer to the question: Snowflake SQL supports conditional inserts, so we can insert each row into a different table with a different schema:

    insert first
    when animal='Lion' 
    then into lions (name, v1, v2) values (name, details:lion_key1, details:lion_key2)
    when animal='Tiger' 
    then into tigers (name, v1, v2, v3) values (name, details:tiger_key1, details:tiger_key2, details:tiger_key3)
        
    select *
    from all_animals
    ;
    

    As seen above, use INSERT WHEN to look at each row and decide into which table you'll insert them into, each with possibly a different schema.

    For this solution you need to know the schema of each resulting table. If you don't know that, then we can explore in a different question how to create tables after exploring the keys to be flattened out of objects.