Search code examples
pythonocr

Extracting text from scanned PDF without saving the scan as a new file image


I would like to extract text from scanned PDFs.
My "test" code is as follows:

from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image

converted_scan = convert_from_path('test.pdf', 500)

for i in converted_scan:
    i.save('scan_image.png', 'png')
    
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
    outfile.write(text.replace('\n\n', '\n'))

I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?

Basically, I would like to skip this part:

for i in converted_scan:
    i.save('scan_image.png', 'png')

I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.

EDIT

Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.

from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io

infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''

# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
    image_png = scan.convert('png')
    for i in image_png.sequence:
        img_page = w_img(image = i)
        req_image.append(img_page.make_blob('png'))
    for i in req_image:
        content = tool.image_to_string(
            p_img.open(io.BytesIO(i)),
            lang = tool.get_available_languages()[0],
            builder = pyocr.builders.TextBuilder()
        )
        txt += content

# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
    full_txt = regex.sub(r'\n+', '\n', txt)
    outfile.write(full_txt)

Solution

  • UPDATE MAY 2021
    I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)

    from pdf2image import convert_from_path
    
    for img in convert_from_path("some_pdf.pdf", 300):
        txt = tool.image_to_string(img,
                                   lang=lang,
                                   builder=pyocr.builders.TextBuilder())
    

    EDIT: you can also try and use pdftotext library

    pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).

    from PIL import Image as Pimage, ImageDraw
    from wand.image import Image as Wimage
    import sys
    import numpy as np
    from io import BytesIO
    
    import pyocr
    import pyocr.builders
    
    def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
        """
        Convert PDF file to JPG
    
        :param in_file_path: path of pdf file to convert
        :param resolution: resolution with which to read the PDF file
        :return: PIL Image
        """
        with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
            for page in all_pages.sequence:
                with Wimage(page) as single_page_image:
                    # transform wand image to bytes in order to transform it into PIL image
                    yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
    
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print("No OCR tool found")
        sys.exit(1)
    # The tools are returned in the recommended order of usage
    tool = tools[0]
    print("Will use tool '%s'" % (tool.get_name()))
    # Ex: Will use tool 'libtesseract'
    
    langs = tool.get_available_languages()
    print("Available languages: %s" % ", ".join(langs))
    lang = langs[0]
    print("Will use lang '%s'" % (lang))
    # Ex: Will use lang 'fra'
    # Note that languages are NOT sorted in any way. Please refer
    # to the system locale settings for the default language
    # to use.
    for img in _convert_pdf2jpg("some_pdf.pdf"):
        txt = tool.image_to_string(img,
                                   lang=lang,
                                   builder=pyocr.builders.TextBuilder())