Search code examples
pythonpython-imaging-librarydata-extractionpypdf

Extracting text from scanned pdf (images) using Python PyPDF2


I have been trying to extract text from a scanned PDF (images with non selectable text).

But, I am getting an out put which is not a human readable.

I want the information which contains DATE, INVOICE NO from pdf link(https://drive.google.com/file/d/1qQsqhlSKTZs-hlswrV8PIirR36896KXZ/view).

Please help me in extracting and storing the same in plain text.

import PyPDF2
from PIL import Image
pdf_reader = PyPDF2.PdfFileReader(r'document.pdf', 'rb')
page = pdf_reader.getPage(85)
if '/XObject' in page['/Resources']:
    xobject = page['/Resources']['/XObject'].getObject()
    for obj in xobject:
        if xobject[obj]['/Subtype'] == '/Image':
            size = (xobject[obj]['/Width'], xobject[obj]['/Height'])
            data = xobject[obj]._data
            print("*******", data)
            print(xobject[obj]['/Filter'])

Solution

  • [UPDATED]
    I don't think PyPDF2 can read text from images...
    To turn images into text I would suggest going with some OCR tool like PyTesseract.
    Here's an example using pdf2image and PyTesseract to achieve what you're looking for (you need to first correctly install PyTesseract/Tesseract and pdf2image):

    import pdf2image
    import pytesseract
    from pytesseract import Output, TesseractError
    
    pdf_path = "document.pdf"
    
    images = pdf2image.convert_from_path(pdf_path)
    
    pil_im = images[0] # assuming that we're interested in the first page only
    
    ocr_dict = pytesseract.image_to_data(pil_im, lang='eng', output_type=Output.DICT)
    # ocr_dict now holds all the OCR info including text and location on the image
    
    text = " ".join(ocr_dict['text'])