aws-lambda amazon-ecr pytorch-dataloader aws-lambda-containers

AWS Lambda container image gives error but Runtime Interface Emulator ran correctly

I have an Amazon Linux EC2 machine where I have developed a docker image. Dockerfile contents are:

FROM public.ecr.aws/lambda/python:3.8

RUN yum install -y gcc-c++ python3-devel
RUN pip install torch torchvision

# Copy function code
COPY app.py ${LAMBDA_TASK_ROOT}

CMD [ "app.handler" ]

and app.py contents are:

def handler(event, context):
    print("INSIDE THE HANDLER")
    import torch
    print("PYTORCH WORKING")
    import torchvision
    print("VISION ALSO WORKING")
    return 'Execution completed'

I ran the following commands

docker build -t shukhapriya .
docker images
docker run -p 9000:8080 shukhapriya

The output when I ran the AWS Lambda Runtime Interface Emulator (RIE) curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}' was the following, which shows my lambda function is correctly using the docker container.

START RequestId: 855590bc-a58f-4c26-a203-9946e0cf5b51 Version: $LATEST
INSIDE THE HANDLER
PYTORCH WORKING
VISION ALSO WORKING
END RequestId: 855590bc-a58f-4c26-a203-9946e0cf5b51
REPORT RequestId: 855590bc-a58f-4c26-a203-9946e0cf5b51  Duration: 1.19 ms       Billed Duration: 2 ms     Memory Size: 3008 MB    Max Memory Used: 3008 MB

I push the above docker image into ecr and everything works smoothly

docker tag shukhapriya 71xxxxx41665.dkr.ecr.us-east-1.amazonaws.com/ml-libraries:latest
docker push 71xxxxx1665.dkr.ecr.us-east-1.amazonaws.com/ml-libraries:latest

However when I create the Lambda function, there's an error in importing. To create the lambda function I go into the console, choose the image URI and don't set any configurations such as ENTRYPOINT override, CMD override, WORKDIR override (as I have already done in the dockerfile). My architecture is same as that of amazon linux machine x86_64, so docker compatibility is not an issue (infact I tried it from macos too but realised there were different architectures).

When I test the lambda function I get the following error, looks like cuda library failed to install.

START RequestId: 2389ffa0-f4b0-4da6-901c-7c300e71c760 Version: $LATEST
INSIDE THE HANDLER
[ERROR] OSError: /var/lang/lib/python3.8/site-packages/nvidia/cufft/lib/libcufft.so.10: failed to map segment from shared object
Traceback (most recent call last):
  File "/var/task/app.py", line 3, in handler
    import torch
  File "/var/lang/lib/python3.8/site-packages/torch/__init__.py", line 228, in <module>
    _load_global_deps()
  File "/var/lang/lib/python3.8/site-packages/torch/__init__.py", line 189, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/var/lang/lib/python3.8/site-packages/torch/__init__.py", line 155, in _preload_cuda_deps
    ctypes.CDLL(lib_path)
  File "/var/lang/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
END RequestId: 2389ffa0-f4b0-4da6-901c-7c300e71c760

If I replace my pip install line in Dockerfile to pip install to pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu I get another error (which I get after the time out):

START RequestId: bfc8a02c-eaa9-4729-8230-809c8008e503 Version: $LATEST
INSIDE THE HANDLER
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
2023-04-01T10:55:20.995Z bfc8a02c-eaa9-4729-8230-809c8008e503 Task timed out after 90.37 seconds

END RequestId: bfc8a02c-eaa9-4729-8230-809c8008e503
REPORT RequestId: bfc8a02c-eaa9-4729-8230-809c8008e503  Duration: 90365.13 ms   Billed Duration: 90600 ms   Memory Size: 128 MB Max Memory Used: 128 MB Init Duration: 234.39 ms

Edit: Trying the same procedure with another library numpy instead of pytorch, the code ran successfully. So this looks like an issue related to pytorch.

Would be grateful for any help, stuck with this for some days now.

Solution

Although I was not able to solve this issue when installing pytorch with cuda, I was able to solve the issue with cpu installation of pytorch.

I had to increase the memory of lambda to 1280 MB and it worked.