Search code examples
python-3.xpdfocrpython-tesseractimage-file

ocr a multipage pdf in python


I am using pytesseract to OCR on images. I have statement pdf that are 3-4 page long. I need a way to convert them into multiple .jpg/.png images and to OCR on these images one by one. As of now, I am converting a single page to image and then I run

text=str(pytesseract.image_to_string(Image.open("imagename.jpg"),lang='eng'))

after which I use regex to extract information and create a dataframe. The regex logic is same for all the pages. Understandably if I can read the image files in a loop, the process can be automated for any pdf coming in same format.


Solution

  • PyMuPDF would be another option for you to loop through image files. Here is how you can achieve this:

    import fitz
    from PIL import Image
    import pytesseract 
    
    input_file = 'path/to/your/pdf/file'
    pdf_file = input_file
    fullText = ""
    
    doc = fitz.open(pdf_file) # open pdf files using fitz bindings 
    ### ---- If you need to scale a scanned image --- ###
    zoom = 1.2 # scale your pdf file by 120%
    mat = fitz.Matrix(zoom, zoom)
    noOfPages = doc.pageCount 
    
    for pageNo in range(noOfPages):
        page = doc.loadPage(pageNo) # number of pages
        pix = page.getPixmap(matrix = mat) # if you need to scale a scanned image
        output = '/path/to/save/image/files' + str(pageNo) + '.jpg'
        pix.writePNG(output) # skip this if you don't need to render a page
    
        text = str(((pytesseract.image_to_string(Image.open(output)))))
        fullText += text
    
    fullText = fullText.splitlines() # or do something here to extract information using regex
    

    It's very handy depending on how you wanted to do with pdf files. For a more detailed information about PyMuPDF, these links might be helpful: tutorial on PyMuPDF and git for PyMuPDF

    Hope this helps.

    EDIT Another more straightforward way of doing this using PyMuPDF is to directly interpret the back-converted text if you have a clean format of PDF files, after page = doc.loadPage(pageNo) just do the following is suffice:

    blocks = page.getText("blocks")
    blocks.sort(key=lambda block: block[3])  # sort by 'y1' values
    
    for block in blocks:
        print(block[4])  # print the lines of this block
    

    Disclaimer: The above idea of using blocks was coming from the repo maintainer. A more detailed info can be found here: issues discussion on git