I'm trying to load more than 20 million records to my Dynamodb table using below code from EMR 5 node cluster. But it is taking more hours and hours time to load completely. I have much more huge data to load, but i want to load it in span of few minutes. How to achieve this?
Below is my code. I just changed original column names and I have 20 columns to insert. The problem here is with slow loading.
import boto3
import json
import decimal
dynamodb = boto3.resource('dynamodb','us-west')
table = dynamodb.Table('EMP')
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='mybucket', Key='emp-rec.json')
records = json.loads(obj['Body'].read().decode('utf-8'), parse_float = decimal.Decimal)
with table.batch_writer() as batch:
for rec in records:
batch.put_item(Item=rec)
First, you should use Amazon CloudWatch to check whether you are hitting limits for your configure Write Capacity Units on the table. If so, you can increase the capacity, at least for the duration of the load.
Second, the code is creating batches of one record, which wouldn't be very efficient. The batch_writer()
can be used to process multiple records, such as in this sample code from the batch_writer()
documentation:
with table.batch_writer() as batch:
for _ in xrange(1000000):
batch.put_item(Item={'HashKey': '...',
'Otherstuff': '...'})
Notice how the for
loop is inside the batch_writer()
? That way, multiple records are stored within one batch. Your code sample, however, has the for
outside of the batch_writer()
, which results in a batch size of one.