amazon-web-services aws-lambda deployment serverless

Serverless (framework) deployment "randomly" deploys broken functions, module not found

Background

We are running a larger application on top of the Serverless framework and keep growing our modules and user-base. All our backend code is sitting in a mono-repo and is compiled from 29 Serverless services (microservices) and consists of close to 300 REST and WebSocket endpoints.

We are deploying from within Github Actions and did not have any issues up until recently (last few months, weeks).

Unfortunately our code never gets called which makes the error also invisible to our own telemetry system.

Actual Problem

When sls deploy is executed from Github Actions "randomly" 1 or 2 functions seem to get deployed as per the log files but when we call them a Cannot find module error is logged.

The error is not related to imports, as we can survey from the stacktrace below it is actually the Serverless function itself which cannot be found in the function bundle.

2023-10-25T09:52:03.202Z    undefined   ERROR   Uncaught Exception  
{
    "errorType": "Runtime.ImportModuleError",
    "errorMessage": "Error: Cannot find module 's_putAnsweredQuestionnaire'\nRequire stack:\n- /var/runtime/UserFunction.js\n- /var/runtime/Runtime.js\n- /var/runtime/index.js",
    "stack": [
        "Runtime.ImportModuleError: Error: Cannot find module 's_putAnsweredQuestionnaire'",
        "Require stack:",
        "- /var/runtime/UserFunction.js",
        "- /var/runtime/Runtime.js",
        "- /var/runtime/index.js",
        "    at _loadUserApp (/var/runtime/UserFunction.js:225:13)",
        "    at Object.module.exports.load (/var/runtime/UserFunction.js:300:17)",
        "    at Object.<anonymous> (/var/runtime/index.js:43:34)",
        "    at Module._compile (internal/modules/cjs/loader.js:1114:14)",
        "    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1143:10)",
        "    at Module.load (internal/modules/cjs/loader.js:979:32)",
        "    at Function.Module._load (internal/modules/cjs/loader.js:819:12)",
        "    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:75:12)",
        "    at internal/main/run_main_module.js:17:47"
    ]
}

Details and Versions

sls --version
Running "serverless" from node_modules
Framework Core: 3.22.0 (local) 3.21.0 (global)
Plugin: 6.2.2
SDK: 4.3.2

Target: node14 (still)

Plugins:

serverless-esbuild
serverless-domain-manager
serverless-plugin-log-retention
serverless-prune-plugin

Github Action is also running with node14.

What we checked

The logs in Github Actions do not show any related errors. Actually Github Actions would fail on any deploy problems and CloudFormation should (and already did in the past) roll back the failed stack (service).

Logs of CloudFormation, for a broken deployment, do not show any errors whatsoever. It looks like the stacks deployed just as usually.

Deployment from developer machines: If we specifically deploy a broken service again from a developer machine (checkout the tag and sls deploy with correct stage params etc.) it works 100% of the times. This is basically the recovery routine we do right now.

The Serverless Dashboard also does not show any failed deployments around the affected services.

Questions

Has someone experienced the same issue?

Given the size of our application (30 services, 300 functions) could we have hit some limitations within AWS that are just not apparent to us?

Where can we continue looking?

References

Solution

This problem could be resolved by moving to the now current 3.38.0 version of the Serverless framework and in parallel moving to node 16.20.0. This of course required us to move to the nodejs16.x target in Serverless.

Since these upgrades deployments are stable again.

Unfortunately we were not able to move to the nodejs18.x target due to our aws-sdk dependencies. Using Node 18 requires AWS-SDKv3.