Search code examples
amazon-web-servicesamazon-s3boto3amazon-textract

Not able to retrieve processed file from S3 Bucket


I'm an AWS newbie trying to use Textract API, their OCR service. As far as I understood I need to upload files to a S3 bucket and then run textract on it.

I got the bucket on and the file inside it: enter image description here

I got the permissions: enter image description here

But when I run my code it bugs.

        import boto3
        import trp

        # Document
        s3BucketName = "textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420"
        documentName = "ok0001_prioridade01_x45f3.pdf"
]\[\[""
        # Amazon Textract client
        textract = boto3.client('textract',region_name="us-east-1",aws_access_key_id="xxxxxx",
                                aws_secret_access_key="xxxxxxxxx")

        # Call Amazon Textract
        response = textract.analyze_document(
            Document={
                'S3Object': {
                    'Bucket': s3BucketName,
                    'Name': documentName
                }
            },
            FeatureTypes=["TABLES"])

Here is the error I get:

botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.

What am I missing? How could I solve that?


Solution

  • You are missing S3 access policy, you should add AmazonS3ReadOnlyAccess policy if you want a quick solution according to your needs.

    A good practice is to apply the least privilege access principle and keep granting access when needed. So I'd advice you to create a specific policy to access your S3 bucket textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420 only and only in us-east-1 region.