I have been writing CloudFormation Stack using yaml and deploying it to AWS Infrastructure ( For legacy reasons, I can not switch to CDK unfortunately ;))
Following yaml code is a part of the cloudformation stack. The yaml code is creating a Glue job. it loads etl script from S3 bucket (name transform_json_to_parquet.py) as a part of the Cloudformation stack (see line ScriptLocation below).
A major limitation of approach is
It expects that transform_json_to_parquet.py script should be present in S3-bucket-name-1. Therefore, I have to manually upload transform_json_to_parquet.py file to S3-bucket-name-1. I am just wondering is there any way that allow me to load transform_json_to_parquet.py file when I deploy cloudformation stack to AWS
TransformJsonDataJob:
Type: "AWS::Glue::Job"
Properties:
Role: !Ref AWSGlueETLJobRole
Name: "TransformJsonToParquet"
Description: "Trasform JSON to Parquet"
Timeout: 5
WorkerType: G.1X
NumberOfWorkers: 2
MaxRetries: 0
Command:
"Name": "glueetl"
"ScriptLocation" : !Sub s3://<S3-bucket-name-1>/transform_json_to_parquet.py
DefaultArguments:
"--s3_json_path" : !Sub s3://<S3-bucket-name-2>/
"--s3_parquet_path" : !Sub s3://<S3-bucket-name-3>/
There are two ways to achieve this:
Using the "aws cloudformation package" command from AWS CLI. In your original cloudformation YAML file, you can refer to the glue script locally. Doc: https://docs.aws.amazon.com/cli/latest/reference/cloudformation/package.html
Using CloudFormation custom resource. This involves creating a Lambda function for the resource, and you can put the glue script inline with the Lambda function code. Doc: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources-lambda.html
I'd recommend to try option 1 first as using custom resource can create more complexities.