python amazon-web-services boto3 amazon-textract

Analyzing a Specific Page of a PDF with Amazon Textract

I am using Amazon Textract to extract text from PDF files. For some of these documents, I want to be able to specify the pages from which data is to be extracted, rather than having to go through the entire thing. Is this possible? If so, how do I do it? I cannot seem to find an answer in the docs.

Solution

I do not believe Textract offers this feature, but you can easily implement it programatically. Since your tags mention python, I'll suggest a way to do this using python. You can use a library like PyPDF2 which lets you specify which pages you want to extract and creates a new pdf with just those pages.

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')

pdf = PdfFileReader(pdf_file_path)

pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()

for page_num in pages:
    pdfWriter.addPage(pdf.getPage(page_num))

with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
    pdfWriter.write(f)
    f.close()

This library can be used with AWS Lambda as a layer. You can save the file temporarily in the /tmp/ folder on lambda.

Source: https://learndataanalysis.org/how-to-extract-pdf-pages-and-save-as-a-separate-pdf-file-using-python/