Search code examples
pythonpython-3.xocrpython-tesseract

Error while performing OCR using pytesseract


I wanna to use pytesseract. This is my code.

import pytesseract 
from pdf2image import convert_from_path 

PDF_file = 'file.pdf'
text = '' 
pages = convert_from_path(PDF_file, 500)
pageText = str(((pytesseract.image_to_string(pages[0])))) 

and at result I get this error

Traceback (most recent call last): File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 409, in pdfinfo_from_path proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE) File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 854, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py", line 1307, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\user\Desktop\projects\pdfparser\pdftest.py", line 13, in pages = convert_from_path(PDF_file, 500) File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 89, in convert_from_path page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"] File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py", line 430, in pdfinfo_from_path raise PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?


Solution

  • As a lot of comments already pointed out, the error message

    PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

    Tells you precisely what went wrong: Poppler is not installed. Please refer to the README for help on that side.

    You see, pdf2image is only a wrapper around the pdftoppm command-line utility. On Linux it is installed by default so you would not need to bother with it, but on Windows it is not.