Search code examples
amazon-web-servicesamazon-redshiftaws-glueaws-glue-data-catalog

What is catalog_connection param in aws glue?


I wish to regularly run a etl job at every 4 hours which will union (combine) data from s3 bucket (parquet format) and data from redshift. Find out the unique and then write it again to redshift, replacing old redshift data. For writing dataframes to redshift, this

glueContext.write_dynamic_frame.from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")

Writes a DynamicFrame using the specified JDBC connection information.
frame – The DynamicFrame to write.
catalog_connection – A catalog connection to use.
connection_options – Connection options, such as path and database table (optional).
redshift_tmp_dir – An Amazon Redshift temporary directory to use (optional).
transformation_ctx – A transformation context to use (optional).

seems to the way. But what does the catalog_connection mean? Does it refer to glue catalog? If yes, then what in glue catalog?


Solution

  • The catalog_connection refers to the glue connection defined inside glue catalog.

    Let's say if there is a connection named redshift_connection in glue connection, it will be used like:

    glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, 
                   catalog_connection = "redshift_connection",
                   connection_options = {"dbtable": df_name, "database": "testdb"},
                   redshift_tmp_dir = "s3://glue-sample-target/temp-dir/")
    

    Here are some examples with the details:
    https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/