Search code examples
pythonpostgresqlpysparkaws-glue

AWS Glue - Truncate destination postgres table prior to insert


I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.

Has anyone been able to do so?


Solution

  • I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.

    1. Download the tar of pg8000 from pypi
    2. Create an empty __init__.py in the root folder
    3. Zip up the contents & upload to S3
    4. Reference the zip file in the Python lib path of the job
    5. Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.

    Then you can simply create a connection and execute SQL.

    import sys
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.job import Job
    
    import pg8000
    
    args = getResolvedOptions(sys.argv, [
        'JOB_NAME',
        'PW',
        'HOST',
        'USER',
        'DB'
    ])
    # ...
    # Create Spark & Glue context
    
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    # ...
    config_port = 5432
    conn = pg8000.connect(
        database=args['DB'], 
        user=args['USER'], 
        password=args['PW'],
        host=args['HOST'],
        port=config_port
    )
    query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
    cur = conn.cursor()
    cur.execute(query)
    conn.commit()
    cur.close()
    conn.close()