I want to use AWS S3 as external storage and use Snowflake external tables to query the data.
To not lose/overwrite any data, I have enabled versioning
on the S3 bucket. When multiple versions of a file exists, they will show as duplicates in Snowflake and I could find an option to hide them from external stages/external tables.
LIST @my_stage; -- shows duplicate files
SELECT $1, $2, $3 FROM @my_stage; -- returns duplicate records
CREATE OR REPLACE EXTERNAL TABLE my_external_table (
Name STRING AS (value:c1::STRING),
)
with location = @my_stage;
SELECT * FROM my_external_table; -- shows duplicate records
SELECT DISTINCT row1, row2, row3 from my_external_table; -- manually hide duplicates
Is there any way to only select rows from the latest version of a file without the need to use DISTINCT
?
Thanks in advance
I turns out that there were actual duplicate files on S3, they have exactly the same name except that one had a whitespace attached at the end of the filename. I could not easily see this in the AWS Console nor via the AWS S3 CLI, but this command shows the file names with "
around them:
aws s3api list-objects --bucket my_bucket