I feel confuse about the next scenario:
I have a parquet file into S3
I copy the parquet file into Snowflake (specifying compression Snappy and format Parquet).
The file is copied into my table and I can see the raw column, were the parquet value appear like a json. And this is my confusion, this value is json, the parquet schema is lost.
So Im wondering. I know the benefits of using Parquet, but indeed I don't know why copying parquet files into Snowflake is good when you miss an important bit of information, like the schema.
My problem arises when I load raw data from different files with different structure. Then I have to extract and cast the fields of each data-structure, one by one, from multiple tables. I was wondering if there is a better way of doing this a bit more generic relying on schemas while using parquet.
That's my favorite way of working with semi-structured data in Snowflake: Copy it into a VARIANT type.
But good news if you want to work directly with the schema — Snowflake is improving its capacities for auto-detection:
INFER_SCHEMA retrieves and returns the schema from a set of staged files.
GENERATE_COLUMN_DESCRIPTION returns the list of columns necessary to create a table, external table, or view. Here you can modify the columns or data types before you complete the creation of the object.
CREATE TABLE … USING TEMPLATE expands upon Snowflake’s CREATE TABLE functionality to automatically create the structured table using the detected schema from the staged files with no additional input.