Search code examples
pythonboto3amazon-textract

Python-Textract-Boto3 - Trying to pass result of a method call as an argument to the same method, and loop


I have a mulit-page pdf on AWS S3, and am using textract to extract all text. I can get the response in batches, where the 1st response provides me with a 'NextToken' that I need to pass as an arg to the get_document_analysis method.

How do I avoid manually running the get_document_analysis method each time manually pasting the NextToken value received from the previous run?

Here's an attempt:

import boto3

client = boto3.client('textract')

# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']

def my_output():
    my_ls = []
    
    # I need to repeat the the following function until the break condition further below
    while True: 
        
        # This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
        x=client.get_document_analysis(JobId = my_job_id_ref) 
        
        # Assinging value of NextToken to a variable
        next_token = x['NextToken'] 
        
        #Running the function again, this time with the next_token passed as an argument.
        x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
        
        # Need to repeat the running of the function until there is no token. The token is normally a string, hence
        if len(next_token) <1:
            break
        
        my_ls.append(x)
        
    return my_ls


Solution

  • The trick is to use the while-condition to check whether the nextToken is empty.

    # Get the analysis once to see if there is a need to loop in the first place
    x=client.get_document_analysis(JobId = my_job_id_ref) 
    next_token = x.get('NextToken')
    my_ls.append(x)
    
    # Now repeat until we have the last page
    while next_token is not None:
        x = client.get_document_analysis(JobId = my_job_id_ref) 
        next_token = x.get('NextToken')
        my_ls.append(x)
    
    

    The value of next_token will be continously overwritten, until it is None - at which point we break out of the loop.

    Note that I'm using the x.get(..) to check if the response-dictionary contains the NextToken. It may not be set in the first place, in which case .get(..) will always return None. (x["NextToken"] will throw a KeyError if the NextToken is not set.)