I am trying to load some GBs of data stored locally in 6 txt files inside some tables in a dockerized local Dynamodb instance using Python3 and the boto3 library.
The problem is the process speed, the estimated time for loading a single file's data (~10M lines) is of 19 Hours. I used a profiler to find the bottleneck in the code and the majority of the computational time is taken by the boto3 function storing the items in the database.
def add_batch(self, items, table):
request = {
table: []
}
if type(items) is not list:
print(f'\nError while loading item\'s batch, expecting a <list> but got {type(items)}')
return None
for item in items:
request[table].append(
{
'PutRequest': {
'Item': item
}
}
)
while request:
response = self._client.batch_write_item(RequestItems=request) # by far the slowest call
if response['UnprocessedItems']:
request = response['UnprocessedItems']
print('unprocessed items: ', request)
else:
request = None
return 0
The batch size is of 25 items, the throughput for the table is 100 (tried a lot of values with little results).
Understandably, I was getting better results when the container was run with the InMemory option set to True. I had to change it because I can't possibly wait hours every time I restart the container waiting for it to load the data.
At the moment I start the container with this simple command:
docker run -p 8000:8000 amazon/dynamodb-local -jar DynamoDBLocal.jar
I tried with some parallelization but the boto3 library doesn't seem to like it since it keeps raising exceptions.
DynamoDB.local is not designed at all for performance. It is merely meant to be for offline functional development and testing before deploying to production in the actual DynamoDB service.