New to AWS Glue ETL processing and trying to implement a Job to extract data from an RDS MySQL DB for a specific customer, perform some transformations and write the results to S3.
What is the best approach to filter the data input selected from the source table can this be done as part of the source extract or does this need to be a separate Filter Transformation based on a specific key?
If implementing this as a Filter Transformation is there a way to make this dynamic based on Job input parameters? Ideally this job will be triggered by an event as part of a user initiated workflow.
Any help would be much appreciated. TIA
What is the best approach to filter the data input selected from the source table can this be done as part of the source extract or does this need to be a separate Filter Transformation based on a specific key?
Glue is basically managed Spark. Spark has a technique called PushDownPredicate which optimises filter operations. It is very likely that Spark will push the filter operation directly into the read operation, by modifying the read statement.
You can check if that is happening in your case by converting the Glue DynamicFrame into a native Spark DataFrame with the .toDF()
method and the calling the explain operation on that DataFrame.
If implementing this as a Filter Transformation is there a way to make this dynamic based on Job input parameters? Ideally this job will be triggered by an event as part of a user initiated workflow.
Yes you can, but not through the Visual UI of Glue Studio, you would need to modify the ETL script manually.