Search code examples
pythonamazon-web-servicesamazon-dynamodbaws-glue

AWS: writing from Pandas dataframe to DynamoDB


I have an AWS Glue job witten in Python. In it is a large Pandas dataframe - the data contained therein needs to be written to DynamoDB.

I am currently using Glue's "write_dynamic_frame" functionality to achieve this because it copes with issues such as the raising of "500 SlowDown" errors which can sometimes occur when writing large amounts of data in a small period of time.

It is working but the actual writing of data to the database is rather slow (over 2 minutes to write 1,000 records).

My process currently looks like this:

my_df = {populate Pandas dataframe...}
table_name = "my_dynamodb_table_name"

# Pandas -> Spark -> DynamicFrame
spark_df: DataFrame = spark.createDataFrame(my_df)
result_df: DynamicFrame = DynamicFrame.fromDF(spark_df, glue_context, "result_df")

num_partitions: int = result_df.toDF().rdd.getNumPartitions()

glue_context.write_dynamic_frame.from_options(
    frame=result_df,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.output.tableName": table_name,
        "dynamodb.throughput.write.percent": "1.5",
        "dynamodb.output.retry": "30"
    }
)

Is there any kind of mechanism for the batch writing of data to DynamoDB? I have over a million records that I need to write.

Thanks for any assistance.


Solution

  • The issue, as hinted by @Parsifal, was to do with the write throughput of my DynamoDB table. Once this was changed to a more suitable value, data was ingested far more quickly.