Search code examples
pythonamazon-web-servicespdfamazon-textract

I do not want to write and read the same document in python


I have pdf files where I want to extract info only from the first page. My solution is to:

  1. Use PyPDF2 to read from S3 and save only the first page.
  2. Read the same one-paged-pdf I saved, convert to byte64 and analyse it on AWS Textract.

It works but I do not like this solution. What is the need to save and still read the exact same file? Can I not use the file directly at runtime?

Here is what I have done that I don't like:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import boto3

def analyse_first_page(bucket_name, file_name):
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, file_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs), strict=False)
    writer = PdfWriter()
    page = pdf.pages[0]
    writer.add_page(page)
    
    # Here is the part I do not like
    with open("first_page.pdf", "wb") as output:
        writer.write(output)

    with open("first_page.pdf", "rb") as pdf_file:
        encoded_string = bytearray(pdf_file.read())

    #Analyse text
    textract = boto3.client('textract')
    response = textract.detect_document_text(Document={"Bytes": encoded_string})

    return response

analyse_first_page(bucket, file_name)

Is there no AWS way to do this? Is there no better way to do this?


Solution

  • You can use BytesIO as stream in memory without write to file then read it again.

    with BytesIO() as bytes_stream:
        writer.write(bytes_stream)
        bytes_stream.seek(0)
        encoded_string = b64encode(bytes_stream.getvalue())