Search code examples
amazon-s3snowflake-cloud-data-platformregexp-replacebulk-load

How do you bulk load parquet files into Snowflake from AWS S3?


I'm trying to bulk load 28 parquet files into Snowflake from an S3 bucket using the COPY command and regex pattern matching. But each time I run the command in my worksheet, I'm getting the following bad response:

Copy executed with 0 files processed.

Inside a folder in my S3 bucket, the files I need to load into Snowflake are named as follows:

  • S3://bucket/foldername/filename0000_part_00.parquet
  • S3://bucket/foldername/filename0001_part_00.parquet
  • S3://bucket/foldername/filename0002_part_00.parquet

...

  • S3://bucket/foldername/filename0026_part_00.parquet
  • S3://bucket/foldername/filename0027_part_00.parquet

Using the Snowflake worksheet, I'm trying to load data into a pre-existing table, using the following commands:

CREATE or REPLACE file format myparquetformat type = 'parquet';

COPY INTO [Database].[Schema].[Table] FROM (
      SELECT $1:field1::VARCHAR(512), $1:field2::INTEGER, $1:field3::VARCHAR(512),
      $1:field4::DOUBLE, $1:field5::VARCHAR(512), $1:field6::DOUBLE
      FROM @AWS_Snowflake_Stage/foldername/
      (FILE_FORMAT => 'myparquetformat', PATTERN => 
'filename00[0-9]+_part_00.parquet')
     )
on_error = 'continue';

I'm not sure why these commands fail to run.

In every example I've seen in the Snowflake documentation, "PATTERN" is only used within the COPY command outside of a SELECT query. I'm not sure if it's possible to use PATTERN inside a SELECT query.

In this case, I think it's necessary to use the SELECT query within the COPY command, since I'm loading in parquet data that would first need to be cast from a single column ($1) into multiple columns with appropriate data types for the table (varchar, integer, double). The SELECT query is what enables the importing of the parquet file into the existing table -- is it possible to find a way around this using a separate staging table?

It's a huge pain to load the parquet files one at a time. Is there any way to bulk load these 28 parquet files using the Snowflake worksheet? Or is it better to try to do this using a Python script and the Snowflake API?


Solution

  • For me the below worked, I agree my pattern is quite simple to select all parquet file in the location, but you can probably verify if the regex pattern is valid.

    COPY INTO <TABLE_NAME> FROM (
        SELECT
            $1:col_name_1,
            $1:col_name_2
        FROM @STAGE_NAME/<PATH_TO_FILES>
    )
    PATTERN = '.*.parquet'
    FORCE = TRUE
    FILE_FORMAT = (
        TYPE = 'parquet'
    );
    

    Side note, Keep in mind that Snowflake has a safety check to skip files if it has already been Staged and loaded once successfully.