Search code examples
amazon-s3snowflake-cloud-data-platformexternal-tables

Snowflake S3 Stage/External table and S3 versioning duplicates


I want to use AWS S3 as external storage and use Snowflake external tables to query the data. To not lose/overwrite any data, I have enabled versioning on the S3 bucket. When multiple versions of a file exists, they will show as duplicates in Snowflake and I could find an option to hide them from external stages/external tables.

LIST @my_stage; -- shows duplicate files

SELECT $1, $2, $3 FROM @my_stage; -- returns duplicate records

CREATE OR REPLACE EXTERNAL TABLE my_external_table (
    Name STRING AS (value:c1::STRING),
)
with location = @my_stage;
SELECT * FROM my_external_table; -- shows duplicate records

SELECT DISTINCT row1, row2, row3 from my_external_table; -- manually hide duplicates

Is there any way to only select rows from the latest version of a file without the need to use DISTINCT? Thanks in advance


Solution

  • I turns out that there were actual duplicate files on S3, they have exactly the same name except that one had a whitespace attached at the end of the filename. I could not easily see this in the AWS Console nor via the AWS S3 CLI, but this command shows the file names with " around them:

    aws s3api list-objects --bucket my_bucket