Search code examples
pythonamazon-web-servicesamazon-s3boto3pdfminer

How to use pdfminer to extract text from PDF files stored in S3 bucket without downloading it locally?


I have a PDF stored in S3 bucket. I want to extract texts using pdfminer from that PDF file.

When the file is stored locally, I am able to extract using the below code :

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
from pdfminer.high_level import extract_pages
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import io
from urllib.parse import urlparse

resource_manager = PDFResourceManager()
file_handle = io.StringIO()
converter = TextConverter(resource_manager, file_handle, laparams=LAParams())

page_interpreter = PDFPageInterpreter(resource_manager, converter)

pdf_file = 'file.pdf'

with open(pdf_file, 'rb') as fh:


    for page in PDFPage.get_pages(fh,
                              caching=True,
                              check_extractable=True):
        page_interpreter.process_page(page)

        text = file_handle.getvalue()

# close open handles
converter.close()
file_handle.close()
total_no_pages = len(list(extract_pages(pdf_file)))
print(total_no_pages)
print(text)

I can extract the texts in a clean fashion.

However, I want to do the same for PDFs stored in S3.

I have made a connection to the S3 bucket and fetched the data like this:

import boto3, os

s3 = boto3.resource(
   service_name='s3',
   region_name=<region-name>,
   aws_access_key_id=<access-key>,
   aws_secret_access_key=<secret-key>
)

    
bucket_name = <bucket_name>
item_name = <folederName/file.pdf>

obj = s3.Object(bucket_name, item_name)
fs = obj.get()['Body'].read()

When I print fs, I see that it returns data in bytes.

Kindly suggest a way to use pdfminer for texts stored in S3.


Solution

  • Instead of

    get_pages(fh,caching=True, check_extractable=True):
    

    you could have:

    get_pages(io.BytesIO(fs), caching=True, check_extractable=True):
    

    By the way, you are still downloading the objects from S3, but not physically saving them on your local hard drive.