Search code examples
parquetsnowflake-cloud-data-platform

How do you read ARRAY column from Parquet file in stage?


I have Parquet files with ARRAY data type in an external stage. I can query the columns with $1:::datatype syntax but I'm having trouble figuring out how to query the column that is ARRAY data type.

Querying it without casting shows the value as this: { "list": [ { "element": "1" }, { "element": "4" } ] }

But what I'm expecting is the array: ["1", "4"]

I've tried casting to array or variant but it's just wrapping the value in [] instead of what I want. I can probably get it to how I want using flatten, array_construct, and such but is there a simpler way to get an ARRAY from Parquet?


Solution

  • The Parquet file support in Snowflake appears to use this form of structured output for list/array datatypes encountered in the file schema.

    While there appears to be no way to influence it via options, the real data carried within the produced structure is intact, so you can perform a transformation via UDF to get your stripped array of primitives:

    create or replace function PARQ_LIST_TO_ARRAY(PARQ_LIST VARIANT)
        returns ARRAY
        language javascript
        as
        '
            values = []
            PARQ_LIST.list.forEach( (e) => {
                values.push(e.element)
            })
            return values
        ';
    

    You can also run this over a Parquet file column directly (such as during load):

    > select $1:arr_col_name as lst from @stage/file.parquet;
    lst
    { "list": [ { "element": "1" }, { "element": "4" } ] }
    
    > select PARQ_LIST_TO_ARRAY($1:arr_col_name) as arr from @stage/file.parquet;
    arr
    ["1", "4"]