Search code examples
mysqlamazon-web-servicespysparkaws-glue

Overwrite MySQL tables with AWS Glue


I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I'd like to use Glue to update the table in MySQL. Is there an option to overwrite data using this key? (Similar to Spark's mode=overwrite). If not - might I be able to truncate the table in Glue before inserting all new data?

Thanks


Solution

  • The workaround I've come up with, which is a little simpler than the alternative posted, is the following:

    • Create a staging table in mysql, and load your new data into this table.
    • Run the command: REPLACE INTO myTable SELECT * FROM myStagingTable;
    • Truncate the staging table

    This can be done with:

    import sys from awsglue.transforms
    import * from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    import pymysql
    pymysql.install_as_MySQLdb()
    import MySQLdb
    db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
    cursor = db.cursor()
    cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
    cursor.fetchall()
    
    db.close()
    job.commit()