I have an AWS Glue job witten in Python. In it is a large Pandas dataframe - the data contained therein needs to be written to DynamoDB.
I am currently using Glue's "write_dynamic_frame" functionality to achieve this because it copes with issues such as the raising of "500 SlowDown" errors which can sometimes occur when writing large amounts of data in a small period of time.
It is working but the actual writing of data to the database is rather slow (over 2 minutes to write 1,000 records).
My process currently looks like this:
my_df = {populate Pandas dataframe...}
table_name = "my_dynamodb_table_name"
# Pandas -> Spark -> DynamicFrame
spark_df: DataFrame = spark.createDataFrame(my_df)
result_df: DynamicFrame = DynamicFrame.fromDF(spark_df, glue_context, "result_df")
num_partitions: int = result_df.toDF().rdd.getNumPartitions()
glue_context.write_dynamic_frame.from_options(
frame=result_df,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": table_name,
"dynamodb.throughput.write.percent": "1.5",
"dynamodb.output.retry": "30"
}
)
Is there any kind of mechanism for the batch writing of data to DynamoDB? I have over a million records that I need to write.
Thanks for any assistance.
The issue, as hinted by @Parsifal, was to do with the write throughput of my DynamoDB table. Once this was changed to a more suitable value, data was ingested far more quickly.