Search code examples
pythonpython-3.xpdfpython-tesseract

PDF to text convert using python pytesseract


I am trying to convert many pdf files into txt. My pdf files are organized in subdirectories within a directory. So I have three layers: directory --> subdirectories --> multiple pdf files in each subdirectory. I am using the following code which is giving me this error ValueError: too many values to unpack (expected 3). The code works when I convert files in a single directory but not in multiple subdirectories.

It might be quite simple but I cannot get my head around it. Any help would be much appreciated. Thanks.

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"K:\pdf_files")

for pdf_path, dirs, files in pdfs:
    for file in files:
    convert_from_path(os.path.join(pdf_path, file), 500)

        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')

            with open(f'{pdf_path}.txt', 'a') as the_file:
                the_file.write(text)

Solution

  • I have just solved the problem in a simpler way by adding * to specify all subdirectories in the directory:

    import pytesseract
    from pdf2image import convert_from_path
    import glob
    
    pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")
    
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)
    
        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')
    
            with open(f'{pdf_path}.txt', 'a') as the_file:
                the_file.write(text)