Search code examples
amazon-web-servicesaws-lambdaparallel-processingamazon-kinesis

Parallelization factor: AWS Kinesis data streams to Lambda


I'm very confused with the concept of ParallelizationFactor.


My understanding

https://stackoverflow.com/a/57534322/13000229
In the past, one KDS shard can send data to only one Lambda instance/invocation. More than one Lambda instance getting data from the same KDS shard can't run concurrently.

https://aws.amazon.com/blogs/compute/new-aws-lambda-scaling-controls-for-kinesis-and-dynamodb-event-sources/
In Nov 2019, a new parameter ParallelizationFactor (Concurrent batches per shard) came out.

The default factor of one exhibits normal behavior. A factor of two allows up to 200 concurrent invocations on 100 Kinesis data shards.


Questions

  1. By using ParallelizationFactor, can more than one Lambda instance get different data from the same KDS shard concurrently?
    For example, the shard has data d1, d2, d3 d4, d5 and d6, and we assume BatchSize = 2 and ParallelizationFactor = 2. Lambda instance A can consume d1 and d2, while Lambda instance B can consume d3 and d4 at the same time. Then once Lambda instance A finishes the first batch, it starts processing d5 and d6 and so on.

Expected process flow

  1. If Question 1 is correct, what might be sacrificed? (e.g. the order in the same shard, one piece of data may be processed more than once)

  2. If Question 1 is not correct, how will data in KDS shards be processed by Lambda concurrently?


Solution

  • Yes when using ParallelizationFactor more than one lambda can process records from the same shard concurrently. The order is maintained because records with the same partition key will not be processed concurrently.

    For example let’s say that you have two partitions: Partition1 and Partition2 and two shards

    Scenario 1: all of your records share only two partition keys: PartitionKey1 and PartitionKey2. In this case all records with PartitionKey1 will end up in Partition1 and all records with PartitionKey2 will end up in Partition2. Setting ParallelizationFactor will not result in any records being processed concurrently because records of the same partition key are processed in order.

    Scenario 2: your records have 20 different partition keys: PartitionKey1…PartitionKey20. Ideally Shard1 will contain around half of your records and Shard2 will contain the other half (if they are evenly distributed across the two shards). Setting ParallelizationFactor in this case will result in records being process concurrently. Records within the shard that have different partition keys will be processed concurrently.