I wish to regularly run a etl job at every 4 hours which will union (combine) data from s3 bucket (parquet format) and data from redshift. Find out the unique and then write it again to redshift, replacing old redshift data. For writing dataframes to redshift, this
glueContext.write_dynamic_frame.from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")
Writes a DynamicFrame using the specified JDBC connection information.
frame – The DynamicFrame to write.
catalog_connection – A catalog connection to use.
connection_options – Connection options, such as path and database table (optional).
redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
transformation_ctx – A transformation context to use (optional).
seems to the way. But what does the catalog_connection
mean? Does it refer to glue catalog? If yes, then what in glue catalog?
The catalog_connection refers to the glue connection defined inside glue catalog.
Let's say if there is a connection named redshift_connection
in glue connection, it will be used like:
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df,
catalog_connection = "redshift_connection",
connection_options = {"dbtable": df_name, "database": "testdb"},
redshift_tmp_dir = "s3://glue-sample-target/temp-dir/")
Here are some examples with the details:
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/