I have a mulit-page pdf on AWS S3, and am using textract to extract all text. I can get the response in batches, where the 1st response provides me with a 'NextToken' that I need to pass as an arg to the get_document_analysis method.
How do I avoid manually running the get_document_analysis method each time manually pasting the NextToken value received from the previous run?
Here's an attempt:
import boto3
client = boto3.client('textract')
# Get my JobId
test_output = client.start_document_text_detection(DocumentLocation = {'S3Object': {'Bucket':'myawsbucket', 'Name':'mymuli-page-pdf-file.pdf'}})['JobId']
def my_output():
my_ls = []
# I need to repeat the the following function until the break condition further below
while True:
# This returns a dictionary, with one key named NextToken, which value will need to be passed as an arg to the next iteration of the function
x=client.get_document_analysis(JobId = my_job_id_ref)
# Assinging value of NextToken to a variable
next_token = x['NextToken']
#Running the function again, this time with the next_token passed as an argument.
x=client.get_document_analysis(JobId = my_job_id_ref, NextToken = next_token)
# Need to repeat the running of the function until there is no token. The token is normally a string, hence
if len(next_token) <1:
break
my_ls.append(x)
return my_ls
The trick is to use the while
-condition to check whether the nextToken is empty.
# Get the analysis once to see if there is a need to loop in the first place
x=client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
# Now repeat until we have the last page
while next_token is not None:
x = client.get_document_analysis(JobId = my_job_id_ref)
next_token = x.get('NextToken')
my_ls.append(x)
The value of next_token
will be continously overwritten, until it is None - at which point we break out of the loop.
Note that I'm using the x.get(..)
to check if the response-dictionary contains the NextToken. It may not be set in the first place, in which case .get(..)
will always return None
. (x["NextToken"]
will throw a KeyError
if the NextToken is not set.)