Search code examples
pythonocrtesseract

Write OCR retrieved text from each image to separate text file corresponding to each image


I am reading a pdf file and converting each page to images and saving the, Next I need to run OCR on each image and identify each image text and write it to a new text file.

I know how to get all text from all images and dump it into one text file.

pdf_dir = 'dir path'
os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG") 

img_dir = 'dir path'
os.chdir(img_dir)

docs = []

for img_file in os.listdir(img_dir):
    if img_file.endswith(".jpg"):
        texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
        text = texts.replace('-\n', '')  
        print(texts)
        img_file = img_file[:-4]
        for text in texts:
            file = img_file + ".txt"
#          create the new file with "w+" as open it
            with open(file, "w+") as f:
                for texts in docs:
                # write each element in my_list to file
                    f.write("%s" % str(texts))
                    print(file)   

I need one text file to be written corresponding to each image which has recognized the text within that image. The files which are presently written are all empty and I am not sure what is going wrong. Can someone help?


Solution

  • There's kind of a lot to unpack here:

    1. You're iterating over docs which is an empty list, to create the text files, so as a result, each text file is merely created (empty) and the file.write is never executed.
    2. You're assigning text = texts.replace('-\n', '') but then you're not doing anything with it, instead iterating over for text in texts so within that loop, text is not the result of the replace but rather an item from the iterable texts.
    3. Since texts is a str, each text in texts is a character.
    4. You're then using texts (also previously assigned) as an iterator over docs (again, this is empty).

    2 and 4 aren't necessarily problematic, but probably are not good practice. 1 seems to be the main culprit for why you're producing empty text files. 3 seems to also be a logical error as you almost certainly do not want to write out individual characters to the file(s).

    So I think this is what you want, but it is untested:

    for img_file in os.listdir(img_dir):
        if img_file.endswith(".jpg"):
            texts = str(((pytesseract.image_to_string(Image.open(img_file)))))
            print(texts)
            file = img_file[:-4] + ".txt"
            #create the new file with "w+" as open it
            with open(file, "w+") as f:
                f.write(texts)
                print(file)