Search code examples
pythonpypdfpdfminerpython-aiofilespdfminersix

Reading pdf in fully asynchronous mode in python


I'm really struggling to read my pdf files asynchronously. I tried using aiofiles which is open-source on GitHub. I want to extract the text from pdfs. I want to do it with pdfminer because pypdf is not rendering math (greek letters) or double letters (e.g. ff) properly for now.

The routine that works is:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

with open(pdf_filename, 'rb') as file:

    resource_manager = PDFResourceManager(caching=False)

    # Create a string buffer object for text extraction
    text_io = StringIO()

    # Create a text converter object
    text_converter = TextConverter(resource_manager, text_io, laparams=LAParams())

    # Create a PDF page interpreter object
    page_interpreter = PDFPageInterpreter(resource_manager, text_converter)

    # Process each page in the PDF file
    
    async for page in extract_pages(file):
        page_interpreter.process_page(page)

     
    text = text_io.getvalue()

but then if I replace with open(pdf_filename, 'rb') as file by async with aiofiles.open(pdf_filename, 'rb') as file, the line async for page in extract_pages(file) is not happy and I get this error:

async for page in extract_pages(file): TypeError: 'async for' requires an object with aiter method, got generator

So how do I get the file returned by aiofiles to be like a normal file with aiter?

And I use that to replace the original extract_pages function to try to make it work asynchronously:

async def extract_pages(file):
    with file:
        for page in PDFPage.get_pages(file, caching=False):
            yield page

Many thanks if you can help me how to read a pdf file asynchronously in python with pdfminer or something equivalent that can read math.


Solution

  • PDFPage.get_pages is really a generator, so it must be wrapped in an asynchronous generator. I haven't found a ready-made solution to do this, so here is my own:

    import asyncio
    
    
    class WrappedStopIteration(Exception):
        """ "StopIteration" can't be transferred through a Future, so we need our own replacement"""
        pass
    
    
    def nextwrap(it):
        try:
            return next(it)
        except StopIteration as e:
            raise WrappedStopIteration(e)
    
    
    async def agen(it):
        loop = asyncio.get_running_loop()
        try:
            while True:
                v = await loop.run_in_executor(None, nextwrap, it)
                yield v
        except WrappedStopIteration:
            pass
    

    (Caveat: Fails if thread-local variables are used or the generator/iterator otherwise assumes that it is executed completely in the same thread.)

    In your case it can be used as follows:

    async def extract_pages(file):
        
        # "with file:" can be omitted because there is already the outer "with"
        # enclosing the whole execution
    
        async for page in agen(PDFPage.get_pages(file, caching=False)):
            yield page