Search code examples
pythonamazon-web-servicesaws-lambdaamazon-textract

AWS start document analysis using textract not working


I am doing a project for my school where I am supposed to do a document analysis on a form using textract and run that output to A2I where the algorithm will determine if the form is approved, rejected or review needed. This textract lambda function should be triggered once a document is uploaded to S3. I am however getting syntax errors when I follow this documentation; https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html

My code is :

import urllib.parse
import boto3

print('Loading function')

##Clients
s3 = boto3.client('s3')
textract = boto3.client('textract')

def analyzedata(bucketName,documentKey):
    print("Loading")
    AnalyzedData= textract.StartDocumentAnalysis("DocumentLocation": { 
      "S3Object": { 
         "Bucket": "bucketName",
         "Name": "documentKey",
      })
    detectedText = ''

    # Print detected text
    for item in AnalyzedData['Blocks']:
        if item['BlockType'] == 'LINE':
            detectedText += item['Text'] + '\n'
            
    return detectedText
      
def writeTextractToS3File(textractData, bucketName, createdS3Document):
    print('Loading writeTextractToS3File')
    generateFilePath = os.path.splitext(createdS3Document)[0] + '.csv'
    s3.put_object(Body=textractData, Bucket=bucketName, Key=generateFilePath)
    print('Generated ' + generateFilePath)





def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))

    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        detectedText = analyzedata(bucket, key)
        writeTextractToS3File(detectedText, bucket, key)
        
        return 'Processing Done!'
        
        
        
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

The code is not yet complete but I am already getting syntax errors:

  "errorMessage": "Syntax error in module 'lambda_function': invalid syntax (lambda_function.py, line 13)",
  "errorType": "Runtime.UserCodeSyntaxError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\" Line 13\n        AnalyzedData= textract.Start_Document_Analysis(\"DocumentLocation\": { \n"
  ]
}

Solution

  • According to the boto3 docs, your syntax should be more like:

    AnalyzedData= textract.start_document_analysis(DocumentLocation={ 
      "S3Object": { 
         "Bucket": "bucketName",
         "Name": "documentKey",
      })
    

    Also note that the FeatureTypes parameter is listed as required.