I have a very large dynamodb table, and I want to use lambda function triggered by a stream. I would like to work in big batches, of at least 1000 items. But when I connect the lambda, I see it is invoked with tiny batches of 1 or 2 items. I increased the window to 15 seconds, and it doesn't help.
I assume it's because the table has a lot of shards, and every batch gathers items from only one shard. Is this correct?
What can be done in order to increase the batch size?
I wrote a deep-dive blog post about the integration of DynamoDB Streams an Lambda (disclaimer, written by me on the company blog - very relevant to the question) - the images are taken from there.
DynamoDB Streams consist of shards that store a record of changes sequentially. Each storage partition in the table maps to at least one shard of a DynamoDB stream. The shards get split if a shard is full or the throughput is too high.
Conceptually, this is how the Lambda Service polls the stream shards:
Crucially, polling the shards happens in parallel, but batching is always per shard in order to maintain the order of changes and have consistent scale-out behavior.
This diagram shows how the configuration options in the event source mapping influence how processing happens.
Let's focus on your situation. If you have a large number of items, and relatively high throughput, chances are that DynamoDB allocates many storage partitions to handle that throughput. That automatically leads to a large number of stream shards (#shards >= #storage_partitions
).
If your changes are well distributed over the table (which is what you want to distribute the load evenly), this means there aren't many changes written to any single shard at any point in time. So for a batch window of a few seconds (15 in your case), the actual batch size may be low. If the changes are focused on some partitions, you should see a relatively high variance in the batch size (unfortunately, there's no metric for it afaik).
The only thing you can control directly here (without larger architectural changes) is the batch window. If you increase that, you should see larger batch sizes at the expense of higher processing latency.
You could consider having a lambda function write these changes to a kinesis firehose delivery stream, configure it to write records in batches to S3, and have another Lambda respond to objects written to S3. This would increase your latency again, but allows for much larger batch sizes.
(I also considered writing to SQS, but the max batch size you can request from there is 10.)