Search code examples
snowflake-cloud-data-platformparquetdata-warehouse

Why copying a parquet file into snowflake when the parquet schema is missed?


I feel confuse about the next scenario:

  1. I have a parquet file into S3

  2. I copy the parquet file into Snowflake (specifying compression Snappy and format Parquet).

  3. The file is copied into my table and I can see the raw column, were the parquet value appear like a json. And this is my confusion, this value is json, the parquet schema is lost.

So Im wondering. I know the benefits of using Parquet, but indeed I don't know why copying parquet files into Snowflake is good when you miss an important bit of information, like the schema.

My problem arises when I load raw data from different files with different structure. Then I have to extract and cast the fields of each data-structure, one by one, from multiple tables. I was wondering if there is a better way of doing this a bit more generic relying on schemas while using parquet.


Solution

  • That's my favorite way of working with semi-structured data in Snowflake: Copy it into a VARIANT type.

    But good news if you want to work directly with the schema — Snowflake is improving its capacities for auto-detection:

    INFER_SCHEMA retrieves and returns the schema from a set of staged files.

    GENERATE_COLUMN_DESCRIPTION returns the list of columns necessary to create a table, external table, or view. Here you can modify the columns or data types before you complete the creation of the object.

    CREATE TABLE … USING TEMPLATE expands upon Snowflake’s CREATE TABLE functionality to automatically create the structured table using the detected schema from the staged files with no additional input.