mysql amazon-web-services pyspark aws-glue

Overwrite MySQL tables with AWS Glue

I have a lambda process which occasionally polls an API for recent data. This data has unique keys, and I'd like to use Glue to update the table in MySQL. Is there an option to overwrite data using this key? (Similar to Spark's mode=overwrite). If not - might I be able to truncate the table in Glue before inserting all new data?

Thanks

Solution

The workaround I've come up with, which is a little simpler than the alternative posted, is the following:

Create a staging table in mysql, and load your new data into this table.
Run the command: REPLACE INTO myTable SELECT * FROM myStagingTable;
Truncate the staging table

This can be done with:

import sys from awsglue.transforms
import * from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import pymysql
pymysql.install_as_MySQLdb()
import MySQLdb
db = MySQLdb.connect("URL", "USERNAME", "PASSWORD", "DATABASE")
cursor = db.cursor()
cursor.execute("REPLACE INTO myTable SELECT * FROM myStagingTable")
cursor.fetchall()

db.close()
job.commit()