python performance for-loop ocr python-tesseract

How can i optimize my Python loop for speed

I wrote some code that uses OCR to extract text from screenshots of follower lists and then transfer them into a data frame.

The reason I have to do the hustle with "name" / "display name" and removing blank lines is that the initial text extraction looks something like this:

Screenname 1

name 1

Screenname 2

name 2

(and so on)

So I know in which order each extraction will be. My code works well for 1-30 images, but if I take more than that its gets a bit slow. My goal is to run around 5-10k screenshots through it at once. I'm pretty new to programming so any ideas/tips on how to optimize the speed would be very appreciated! Thank you all in advance :)


from PIL import Image
from pytesseract import pytesseract
import os
import pandas as pd
from itertools import chain

list_final = [""]
list_name = [""]
liste_anzeigename = [""]
list_raw = [""]
anzeigename = [""]
name = [""]
sort = [""]
f = r'/Users/PycharmProjects/pythonProject/images'
myconfig = r"--psm 4 --oem 3"

os.listdir(f)
for file in os.listdir(f):
    f_img = f+"/"+file
    img = Image.open(f_img)
    img = img.crop((240, 400, 800, 2400))
    img.save(f_img)

for file in os.listdir(f):
    f_img = f + "/" + file
    test = pytesseract.image_to_string(PIL.Image.open(f_img), config=myconfig)

    lines = test.split("\n")
    list_raw = [line for line in lines if line.strip() != ""]
    sort.append(list_raw)

    name = {list_raw[0], list_raw[2], list_raw[4],
            list_raw[6], list_raw[8], list_raw[10],
            list_raw[12], list_raw[14], list_raw[16]}
    list_name.append(name)

    anzeigename = {list_raw[1], list_raw[3], list_raw[5],
                   list_raw[7], list_raw[9], list_raw[11],
                   list_raw[13], list_raw[15], list_raw[17]}
    liste_anzeigename.append(anzeigename)

reihenfolge_name = list(chain.from_iterable(list_name))
index_anzeigename = list(chain.from_iterable(liste_anzeigename))
sortieren = list(chain.from_iterable(sort))

print(list_raw)
sort_name = sorted(reihenfolge_name, key=sortieren.index)
sort_anzeigename = sorted(index_anzeigename, key=sortieren.index)

final = pd.DataFrame(zip(sort_name, sort_anzeigename), columns=['name', 'anzeigename'])
print(final)

Solution

Use a multiprocessing.Pool.

Combine the code under the for-loops, and put it into a function process_file. This function should accept a single argument; the name of a file to process.

Next using listdir, create a list of files to process. Then create a Pool and use its map method to process the list;

import multiprocessing as mp

def process_file(name):
    # your code goes here.
    return anzeigename # Or watever the result should be.


if __name__ is "__main__":
    f = r'/Users/PycharmProjects/pythonProject/images'
    p = mp.Pool()
    liste_anzeigename = p.map(process_file, os.listdir(f))

This will run your code in parallel in as many cores as your CPU has. For a N-core CPU this will take approximately 1/N times the time as doing it without multiprocessing.

Note that the return value of the worker function should be pickleable; it has to be returned from the worker process to the parent process.