Search code examples
snowflake-cloud-data-platformsnowflake-schema

External staging data of parquet returns NULL in Snowflake


I am having an external stage on S3 where the parquet files are stored. Now, I have created a table as T1_D1:

CREATE TABLE T1_D1(ID NUMBER(38,0),
TRANSACTION_DATE TIMESTAMP_NTZ(9),
PRODUCT VARCHAR(16777216))

Now, I am running the below script, to test whether the data I am loading is correct or not.

select 
$1:ID::number,
$1:PRODUCT::VARCHAR(16777216)
from @my_s3/T1/day_2_sales.parquet
(file_format => my_parq)    

It successfully returns the ID numbers, whereas the product is returned as NULL, in all the rows. I have even tried to run with the transaction date column but the same problem persists. (i.e) ID matches and returns the data correctly but timestamp values throw me NULL.

$1:ID::NUMBER, $1:TRANSACTION_DATE::TIMESTAMP_NTZ(9)
1,
2,
3,
6,
7,
8,
9,
6,
7,
8,
9,
10,
11,

As an additional step i have used AWS crawler, on the source parquet files which gives me a schema of the following table as follows.

1 id : bigint
2 transaction_date : string 
3 product : string

Solution

  • Things i have Experimented out are as follows.

    • First i have tried to test with individual columns, i.e tried to print ID, and PRODUCT column by using the following query.

    select $1:ID::number, $1:PRODUCT::VARCHAR(16777216) from @my_s3/T1/day_2_sales.parquet (file_format => my_parq)

     It returned ID's correctly but the product values are null.
    
    • So i have used an AWS crawler, in order to find the data types of the source schema, and have written the snowflake query accordingly. by changing the data types in the snowflake. But still it returns NULL.

    • Then i found out that, the Naming Convention of the columns in the source parquet file are not appropriate. i.e the destination column "ID" is named as "ID" itself in source system, but where as "PRODUCT" is named as "Product" in the source parquet file. and TRANSACTION_DATE is named as Transaction_Date.

    • The below Query worked successfully.

    select $1:ID::number, $1:Transaction_date::timestamp, $1:Product::varchar from @my_s3/T1/day_1_sales.parquet (file_format => my_parq)