Background: Running a Dotnet 6 API on AWS Lambda. This is behind an API Gateway and this connects to MySQL Aurora RDS in same VPC using EF core. Problem: Yesterday, deployed a new version to the lambda. Before deployment - the API was running within 300-400ms for all endpoints. Post deployment - API endpoints started returning slow responses (1 in 3 times) typically taking 15s to return a response. Checking Cloudwatch logs - I see the 15s delay occurs on opening connection to RDS. To rule out an issue with the code/deployment, I rolled back to previous version. Yet same issue persisted. This tells me this is an external factor related to AWS.
Has anyone else faced this issue - any pointers on what further I can check?
Dotnet connection code connecting to RDS
//Lambda Entry point: Amazon.Lambda.AspNetCoreServer.APIGatewayProxyFunction
//Startup.cs->ConfigureServices:
ConnectionStrings connectionStrings = paramStore.GetConnections("Conn");
// Connection string to RDS: "Server=abcd-aurora-cluster.cluster-abcdef.us-east-1.rds.amazonaws.com,3306;Database=xyz;uid=abc;pwd=def;Convert Zero Datetime=True;old Guids=true"
_ = services.AddDbContext<ApplicationDbContext>(options =>
{
_ = options.UseMySQL(connectionStrings.WriterConnectionName)
.AddXRayInterceptor(collectSqlQueries: true);
}, ServiceLifetime.Scoped);
// Using MySql.EntityFrameworkCore(7.0.0)
What I tried: I reviewed that connection string/params have not changed. DB version before/after is same. I have not pushed any infra level/network changes either. I'm facing this issue on multiple AWS accounts (prod/dev - where we deployed yesterday) yet see that on an account which was last deployed 2 days back (with same code build) does not have the same issue.
I have already faced this issue 2 months back. At that point - determined that the AWS dotnet lambda runtime auto-upgraded itself on deployment. Faced an issue in moving up from dotnet:6.v25 to dotnet:6.v26. So, I set the runtime upgrade as MANUAL to avoid this disruption in future.
Yet the same issue is now occurring post deployment while runtime is still set as manual and on dotnet:6.v25 Still I tried both downgrading the runtime (dotnet:6.v19) and upgrading (dotnet:6.v32). Both runtime still face the same issue.
I have opened an AWS support ticket - but in my past experience last time - they could not identify the issue and I figured it out a week later on my own.
Appreciate any help/tips here. Thanks all!
We were able to resolve this with some hit and trial. AWS Support did not help in time.
Solution 1: 2 months ago - Issue occurred on AWS dotnet lambda runtime auto-upgrade from dotnet:6.v25 to dotnet:6.v26. Based on docs, this occurs either in 2 phases on function create/update or a week later for non-upgraded lambdas.
So, I set the Runtime management\Update runtime version= Manual to avoid this disruption in future. AWS does not maintain a list of these runtimes and we had to go back in our logs and retrieve the ARN for the right runtime. This helped us maintain stability from Feb-May-2024.
Solution 2: Despite fixing the runtime, we still faced an issue on 9-May-2024. Same lag in opening a connection from Lambda to Aurora RDS on same VPC private subnet. After some experimentation, we found 2 things which helped.
;ConnectionTimeout=2;
. Post this change, most of the slow API responses started trending around 2s. Since we also have connection failure retry mechanism, few responses took 7-8s but this was much lower than before where we had every slow API trending ~ 15s.The length of time (in seconds) to wait for a connection to the server before terminating the attempt and generating an error.