I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# Analyze the document
client = boto3.client('textract')
response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
FeatureTypes=["FORMS"])
process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')
I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
Am I missing something?
AnalyzeDocument is a synchronous API that only supports PNG or JPG images.
Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API e.g StartDocumentAnalysis, StartDocumentTextDetection