amazon-web-services .net-core aws-lambda amazon-sqs

AWS Lambda Performance Drops Heavily with Increasing of Concurrent Instances

In our application (.net core) there is a complex CPU oriented algorithm. It normally takes 2-3 minutes per single execution. Right now we execute this in a background service sequentially. So we only can manage about 25 successful executions per hour which is not enough when there is high demand. Execution in multiple threads also didn't help given this is a highly CPU oriented job. In fact, it gave even worse results with multiple threads.

So I thought of using AWS Lambdas. So I created a Lambda Function capable of executing the logic and it is triggered by an AWS SQS. So whenever I need to execute the logic, a message is pushed to the queue, and Lambda picks and executes it.

When there is only one request Lambda also took 2-3 minutes per execution and that is ok. I have set Lambda's timeout as 15 minutes just in case.

However, the Problem starts when there is a large number of requests (ex: 1000 withing 5 mins). As expected Lambda increases the number of instances. But that eventually drops the performance of all the instances. In fact, almost all of them can't complete the job within the 15-minute timeout.

So I presume all the parallel instances of Lambdas are span out in one/few PCs where they share the same/few CPUs which eventually simulate the condition I initially had with multiple threads. Contrary to my original thought of each instance gets configured memory (Allocated 512 MB. it normally needs less than 180MB) and adequate CPU for it.

The package size is 15Mb. Since cold start time is not a big issue for me, so I think provisioned concurrency also wouldn't help me either (not sure). Besides, It needs to be configured with a particular version which will add lots of hassle during subsequent deployments.

I hope the problem is clear. Has anyone come across something like this or knows how to get over with this?

Thanks.

Solution

Based on the description of the problem, it may be inferred that the bottleneck is not in Lambda or SQS. The root cause of the problem could be with the data access layer. Adding more parallel threads to access data will only burden the data access layer more. It will therefore reduce the performance.

Here are the possible solutions that will improve the performance of the data access layer:

Add a cache in front of Database, to handle read requests
Increase the memory or machine type for the Database server
Move database storage to high performance SSD volume
Add read-replicas for database and direct all read requests to the replicas.
Switch to AWS Aurora DB, which offers 5x performance boost.