Search code examples
amazon-web-servicesamazon-s3amazon-redshiftaws-glue

S3 to Redshift Glue Job Duplicating Records


Using the visual editor on AWS Glue I am tring to write Parquet files added daily to Redshift to Redshift table. The Redshift node in the visual editor is set to APPEND. The Redshift access type is Glue Data Catalog tables.

In the Glue workflow, a glue crawler finds the new folder daily and another crawls the Redshift table. Then the new folder should just be added to the table. This is how it previously worked correctly for me.

However now when creating new jobs, the generated code for the Redshift node has changed and it is also readding all the data from previously crawled folders. I would like to get it working without manually editing code as then you can not use the visual editor again.

This is what the previous code generated for the Redshift node by the visual editor looked like:

# Script generated for node Redshift Cluster
RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
    frame=SelectFields_node1654607509850,
    database="databasename",
    table_name="table_name",
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="RedshiftCluster_node3",
)

This is what the code generated by the visual editor looks like now when you create a new job:

# Script generated for node Amazon Redshift
AmazonRedshift_node3 = glueContext.write_dynamic_frame.from_options(
    frame=SelectFields_node16899376351254,
    connection_type="redshift",
    connection_options={
        "redshiftTmpDir": "s3://aws-glue-assets-560267457632-us-east-1/temporary/",
        "useConnectionProperties": "true",
        "dbtable": "public.tablename",
        "connectionName": "RedshiftJDBCConnectionName",
        "preactions": "CREATE TABLE IF NOT EXISTS public.tablename (col1 VARCHAR, col2 VARCHAR);",
    },
    transformation_ctx="AmazonRedshift_node3",
)

Solution

  • In order to solve the problem of duplication you should try to set the crawler properties to crawl-new folders only and also enable job bookmark and transformation_ctx name should be unique everytime as it would have the state information which will help it to remember where was the last check point it run and then start from next not the previous.