Search code examples
amazon-web-servicesamazon-s3aws-lambdaamazon-textract

Textract cannot read object from S3 when running on Lambda


I have a simple Lambda function that should invoke Textract when files are uploaded to S3. However, the call to Textract works properly when I run the Lambda function from my desktop and doesn't work when I run the exact same code from the Lambda environment.

This is the Lambda code:

import os
import boto3

TEXTRACT_CLIENT = boto3.client('textract', region_name=os.environ['REGION'])


def lambda_handler(event, context):
    event_source = event['Records'][0]['s3']

    textract_ticket = TEXTRACT_CLIENT.start_document_analysis(
        DocumentLocation={
            'S3Object': {
                'Bucket': os.environ['REQUESTS_BUCKET'],
                'Name': event_source['object']['key']

            }
        },
        FeatureTypes=["TABLES", "FORMS"],
        NotificationChannel={
            'RoleArn': os.environ['TEXTRACT_ROLE_ARN'],
            'SNSTopicArn': os.environ['SNS_TOPIC_ARN']
        },
        OutputConfig={
            'S3Bucket': os.environ['RESULTS_BUCKET']
        }
    )

    return {
        'statusCode': 200,
        'JobId': textract_ticket['JobId']
    }

Nothing special with the code. I'm using exacly the same values for all the environment variables both in Lambda environment and in my local machine. In both cases I'm using the same event, point to the same S3 object:

{
  "Records": [
    {
      "eventVersion": "2.0",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "1970-01-01T00:00:00.000Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "EXAMPLE"
      },
      "requestParameters": {
        "sourceIPAddress": "127.0.0.1"
      },
      "responseElements": {
        "x-amz-request-id": "EXAMPLE123456789",
        "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH"
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "testConfigRule",
        "bucket": {
          "name": "my-bucket",
          "ownerIdentity": {
            "principalId": "EXAMPLE"
          },
          "arn": "arn:aws:s3:::example-bucket"
        },
        "object": {
          "key": "35264254-7aa6-4f24-815a-f73e1671f151.pdf",
          "size": 1024,
          "eTag": "0123456789abcdef0123456789abcdef",
          "sequencer": "0A1B2C3D4E5F678901"
        }
      }
    }
  ]
}

Oddly, all this produces a successful execution when invoked from my desktop, but when I run from Lambda I get:

{
  "errorMessage": "An error occurred (InvalidS3ObjectException) when calling the StartDocumentAnalysis operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.",
  "errorType": "InvalidS3ObjectException",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 10, in lambda_handler\n    textract_ticket = TEXTRACT_CLIENT.start_document_analysis(\n",
    "  File \"/var/runtime/botocore/client.py\", line 386, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 705, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

Am I missing something here? I can't figure out what can be wrong in the Lambda environment.


Solution

  • Both @Ronan Cunningham and @stijndepestel intuitions were correct.

    I made confusion about roles. There are two roles involved in this example: The Lambda role and the role under which Textract runs. I mistakenly thought that Textract role were used for its complete execution (Textract role had full S3 access), but it's only used for sending SNS notifications. But in fact textract runs under the same role assigned to Lambda, which hadn't S3 access permission. After adding S3 access permission to Lambda role, everything worked as expected.

    Thank you guys!