Search code examples
pythonpdfpdfminer

Extract text per page with Python pdfMiner?


I have experimented with both pypdf and pdfMiner to extract text from PDF files. I have some unfriendly PDFs that only pdfMiner is able to extract successfully. I am using the code here to extract text for the entire file. However, I would really like to extract text on a per page basis like the pages[i].extract_text() functionality in pypdf. Does anyone know how to extract text per page using pdfMiner?


Solution

  • Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but this project is largely dormant. For the active project, check out its fork pdfminer.six.

    Here is an example of how to extract the text of the first and forty-second pages using pdfminer.six's extract_text function.

    from pdfminer.high_level import extract_text
    text = extract_text('samples/simple1.pdf', page_numbers=[0, 41])
    

    page_numbers – List of zero-indexed page numbers to extract.

    If you only need one page set page_numbers to a single element list (i.e. page_numbers=[41]).

    Original Answer

    for pageNumber, page in enumerate(PDFDocument.get_pages()):
        if pageNumber == 42:
            #do something with the page
    

    There is a pretty good article here.