Search code examples
pythonamazon-web-servicesapache-sparkpysparkaws-glue

AWS Glue Spark Job Extract Database Table Data in Parallel


Suppose I have this simple AWS Glue 4.0 PySpark job:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

tables = ['table1','table2','table3','table4','table5']

for table in tables:
    dyf = glueContext.create_dynamic_frame.from_options(
        connection_type="sqlserver",
        connection_options={
            "useConnectionProperties": "true",
            "dbtable": table,
            "connectionName": "my-glue-jdbc-connection",
        }
    )
    glueContext.write_dynamic_frame.from_options(
        frame=dyf,
        connection_type="s3",
        format="parquet",
        connection_options={"path": f"s3://my-s3-bucket/{table}", "partitionKeys": []},
        format_options={"compression": "gzip"}
    )

This job completes with the expected output in S3, however, everything within the for loop is run sequentially and as a result is pretty slow.

Is there a way to rearrange the code such that each write_dynamic_frame operation runs in parallel?


Solution

  • I would rearrange the code in such a way which allows me to pass the table name as an argument to the glue job then invoke a fresh instance of the glue job for each table so that the processing of each table can run independently from the others. You can use the boto client to invoke a Glue job or maybe create a simple lambda function to do the same.

    Also to access the parameter you can use getResolvedOptions

    args = getResolvedOptions(sys.argv, ["JOB_NAME", "TABLE"])
    args['TABLE']