Search code examples
pythonpandassnowflake-cloud-data-platformdaskparquet

snowflake unload to S3 as parquet has no column names nor correct datatypes


The following produces a parquet file in S3:

USE DATABASE SANDBOX;
USE SCHEMA SANDBOX;

CREATE OR REPLACE FILE FORMAT my_parquet_format 
  TYPE = parquet;

COPY INTO @bla/x_
FROM (
    SELECT 
        TOP 10
        xxx AS "id",
    FROM table
)
FILE_FORMAT = (FORMAT_NAME = my_parquet_format)
OVERWRITE=TRUE;

Alas the column "id" arrives as _COL_0 and the data type is object when I use:

s3_path = 's3://ddd/dddd__0_0_0.snappy.parquet'
df = pd.read_parquet(s3_path, engine='pyarrow')

or Dask. I tried:

USE DATABASE SANDBOX;
USE SCHEMA SANDBOX;

CREATE OR REPLACE FILE FORMAT my_parquet_format 
  TYPE = parquet;

COPY INTO @bla/x_
FROM (
    SELECT 
        TOP 10
        xxx AS "id",
    FROM table
)
FILE_FORMAT = (FORMAT_NAME = my_parquet_format)
OVERWRITE=TRUE HEADER=TRUE;

as some suggested but it produces a corrupt parquet file. Any ideas? Thanks!


Solution

  • I changes the above as follows. This enforces some schema and pulls though columns names. Guess one cannot enforce panda's category format?

    CREATE OR REPLACE FILE FORMAT my_parquet_format 
      TYPE = parquet;
    
    COPY INTO @bla/x_
    FROM (
        SELECT 
            TOP 10
            xxx::SMALLINT AS "id",
        FROM table
    )
    file_format = (type = 'parquet')
    header = true overwrite = true;