Suppose I have this simple AWS Glue 4.0 PySpark job:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
tables = ['table1','table2','table3','table4','table5']
for table in tables:
dyf = glueContext.create_dynamic_frame.from_options(
connection_type="sqlserver",
connection_options={
"useConnectionProperties": "true",
"dbtable": table,
"connectionName": "my-glue-jdbc-connection",
}
)
glueContext.write_dynamic_frame.from_options(
frame=dyf,
connection_type="s3",
format="parquet",
connection_options={"path": f"s3://my-s3-bucket/{table}", "partitionKeys": []},
format_options={"compression": "gzip"}
)
This job completes with the expected output in S3, however, everything within the for
loop is run sequentially and as a result is pretty slow.
Is there a way to rearrange the code such that each write_dynamic_frame
operation runs in parallel?
I would rearrange the code in such a way which allows me to pass the table
name as an argument to the glue job then invoke a fresh instance of the glue job for each table
so that the processing of each table can run independently from the others. You can use the boto client to invoke a Glue job or maybe create a simple lambda function to do the same.
Also to access the parameter you can use getResolvedOptions
args = getResolvedOptions(sys.argv, ["JOB_NAME", "TABLE"])
args['TABLE']