Search code examples
amazon-web-servicesamazon-s3amazon-redshiftetl

AWS S3: How to plug in a dynamic file name in the S3 directory in COPY command


I have job in Redshift that is responsible for pulling 6 files every month from S3. File names follow a standard naming convention as "file_label_MonthNameYYYY_Batch01.CSV". I'd like to modify the below COPY command to change the file naming in the S3 directory dynamically so I won't have to hard code the Month Name and YYYY and batch number. Batch number ranges 1-6.

Currently, here is what I have which is not efficient:

COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch01.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL 
BLANKSASNULL 
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;

COPY tbl_name ( column_name1, column_name2, column_name3 )
FROM 'S3://bucket_name/folder_name/Static_File_Label_July2021_Batch02.CSV'
CREDENTIALS 'aws_access_key_id = xxx;aws_secret_access_key = xxxxx'
removequotes
EMPTYASNULL 
BLANKSASNULL 
DATEFORMAT 'MM/DD/YYYY'
delimiter ','
IGNOREHEADER 1;

The dynamic file name shall change to August2021_Batch01 & August2021_Batch02 next month and so forth. Is there a way to do this? Thank you in advance.


Solution

  • There are lots of approaches to this. Which one is best for your case will depend on your circumstances. You need a layer in your process that controls configuring SQL for each month. Here are some ways to consider:

    1. Use a manifest file - This file will have the S3 object names to load. Your processing / file prep can update this file
    2. Use a fixed load folder where the files are located for COPY, then move these files to perm storage location after COPY.
    3. Use variables in you bench to set the Month value and replace this in when the SQL is issued to Redshift.
    4. Write some code (Lambda?) to issue the SQL you are looking for
    5. Last I checked you could leave the object name incomplete and all matching objects would be loaded. Leave off the batch number and suffix and load all the files with one text change.

    It is desirable to load multiple files with a COPY command (uses more nodes in parallel) and options 1, 2, and 5 do this.